perm filename V2ANS2.TEX[TEX,DEK]6 blob
sn#506773 filedate 1980-05-03 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00014 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 \input acphdr % Answer pages (double-check position of figures)
C00004 00003 %folio 723 galley 10b (C) Addison-Wesley 1978 *
C00008 00004 %folio 725 galley 1 (C) Addison-Wesley 1978 *
C00027 00005 %folio 731 galley 2a (C) Addison-Wesley 1978 *
C00044 00006 %folio 733 galley 2b (C) Addison-Wesley 1978 *
C00049 00007 %folio 735 galley 3a (C) Addison-Wesley 1978 *
C00055 00008 %folio 738 galley 3b (C) Addison-Wesley 1978 *
C00067 00009 %folio 742 galley 4a (C) Addison-Wesley 1978 *
C00080 00010 %folio 745 galley 4b (C) Addison-Wesley 1978 *
C00089 00011 %folio 748 galley 5 (C) Addison-Wesley 1978 *
C00108 00012 %folio 754 galley 6 (C) Addison-Wesley 1978 *
C00122 00013 %folio 758 galley 7a (C) Addison-Wesley 1978 *
C00133 00014 \eject % assumes that page break at this point works OK, so file v2ans1 not too long
C00134 ENDMK
C⊗;
\input acphdr % Answer pages (double-check position of figures)
\open0=v2ans2.inx
\runninglefthead{ANSWERS TO EXERCISES}
\titlepage\setcount00
\null
\vfill
\tenpoint
\ctrline{ANSWER PAGES for THE ART OF COMPUTER PROGRAMMING}
\ctrline{(Volume 2)}
\ctrline{(second half of the answers)}
\ctrline{$\copyright$ 1980 Addison--Wesley Publishing Company, Inc.}
\vfill
\ninepoint
\runningrighthead{ANSWERS TO EXERCISES}
\section{4.1}
\penalty-9999
\setcount0 562
%folio 723 galley 10b (C) Addison-Wesley 1978 *
\ansbegin{4.1}
\ansno 1. 1010, 1011, 1000, $\ldotss$,
11000, 11001, 11110.
\ansno 2. (a)\9 $-110001$, $-11.001001001001 \ldotss
$, $11.0010010000111111 \ldotss\,$.
(b)\9 $11010011$, $1101.001011001011 \ldotss$ , $111.011001000100000
\ldotss\,$.
\def\≡{\overline1}
(c)\9 $\≡11\≡\≡$, $\≡0.0\≡\≡0110\≡\≡011 \ldotss
$, $10.011\≡111\≡ \ldotss\,$.
(d)\9 $-9.4$, $- \ldotsm 7582417582413$, $\ldotss 562951413$.
\ansno 3. $(1010113.2)↓{2i}$.
\ansno 4. (a)\9 Between rA and rX.\xskip (b) The remainder
in rX has radix point between bytes 3 and 4; the quotient in
rA has radix point one byte to the right of the least significant
portion of the register.
\ansno 5. It has been subtracted from $999 \ldotsm
9 = 10↑p - 1$, instead of from $1000 \ldotsm 0 = 10↑p$.
\ansno 6. (a,c)\9 $2↑{p-1} - 1$, $-(2↑{p-1} - 1)$;\xskip (b) $2↑{p-1}
- 1$, $-2↑{p-1}$.
\ansno 7. A ten's complement representation for
a negative number $x$ can be obtained by considering $10↑n +
x$ (where $n$ is large enough for this to be positive) and extending
it on the left with infinitely many nines. The nines' complement
representation can be obtained in the usual manner.\xskip (These two
representations are equal for nonterminating decimals, otherwise
the nines' complement representation has the form $\ldotss (a)99999
\ldots$ while the ten's complement representation has the form
$\ldotss (a + 1)0000 \ldotss\,$.) The representations may be
considered sensible if we regard the value of the infinite sum
$N = 9 + 90 + 900 + 9000 + \cdots$ as $-1$, since $N
- 10N = 9$.
See also exercise↔31, which considers $p$-adic
number systems. The latter agree with the $p$'s complement notations
considered here, for numbers whose radix-$p$ representation
is terminating, but there is no simple relation between the
field of $p$-adic numbers and the field of real numbers.
\ansno 8. $\sum ↓j a↓jb↑j = \sum ↓j (a↓{kj+k-1}b↑{k-1}
+ \cdots + a↓{kj})b↑{kj}$.
\ansno 9. {\tt A BAD AD0BE FACADE FADED}.\xskip [{\sl Note:}
Other possible ``number sentences'' would be \.{D0} \.{A} \.{DEED} \.{A}
\.{DECADE}; \.{A} \.{CAD} \.{FED} \.{A} \.{BABE} \.{BEEF},
\.{C0C0A}, \.{C0FFEE}; \.{B0B} \.{FACED} \.{A} \.{DEAD} \.{D0D0}.]
%folio 725 galley 1 (C) Addison-Wesley 1978 *
\ansno 10. $\dispstyle
\left[\vcenter{\halign{\rt{$#$}⊗\rt{$\,#$}⊗\rt{$\,#$}⊗\rt{$\,
#$}⊗\rt{$\,#$}⊗\rt{$\;#$}⊗\rt{$\;#\ldotsm$}\cr
\ldotsm,⊗a↓3,⊗a↓2,⊗a↓1,⊗a↓0;⊗a↓{-1},⊗a↓{-2},\cr
\ldotsm,⊗b↓3,⊗b↓2,⊗b↓1,⊗b↓0;⊗b↓{-1},⊗b↓{-2},\cr}}\right]=
\left[\vcenter{\halign{\rt{$#$}⊗\rt{$\,#$}⊗\rt{$\,#$}⊗\rt{$\,
#$}⊗\rt{$\,#$}⊗\rt{$\;#$}⊗\rt{$\;#\ldotsm$}\cr
\ldotsm,⊗A↓3,⊗A↓2,⊗A↓1,⊗A↓0;⊗A↓{-1},⊗A↓{-2},\cr
\ldotsm,⊗B↓3,⊗B↓2,⊗B↓1,⊗B↓0;⊗B↓{-1},⊗B↓{-2},\cr}}\right]$, if
$$A↓j=\left[\vcenter{\halign{\rt{$# $}⊗\rt{$\,#,\ldotss,$}⊗\rt{$\,#$}\cr
a↓{k↓{j+1}-1},⊗a↓{k↓{j+1}-2}⊗a↓{k↓j}\cr
⊗b↓{k↓{j+1}-2}⊗b↓{k↓j}\cr}}\right],\qquad B↓j=b↓{k↓{j+1}-1}\ldotsm b↓{k↓j},$$
where $\langle k↓n\rangle$ is any infinite sequence
of integers with $k↓{j+1} > k↓j$.
\ansno 11. (The following algorithm works both for addition
or subtraction, depending on whether the plus or minus sign
is chosen.)
Start by setting $k ← a↓{n+1} ← a↓{n+2} ← b↓{n+1}
← b↓{n+2} ← 0$; then for $m = 0$, 1, $\ldotss$, $n + 2$ do the following:
Set $c↓m ← a↓m \pm b↓m + k$; then if $c↓m ≥ 2$, set $k ← -1$
and $c↓m ← c↓m - 2$; otherwise if $c↓m < 0$, set $k ← 1$ and $c↓m ← c↓m
+ 2$; otherwise (i.e., if $0 ≤ c↓m ≤ 1$), set $k ← 0$.
\ansno 12. (a)\9 Subtract $\pm(\ldotsm a↓30a↓10)↓{-2}$ from $\pm
(\ldotsm a↓40a↓20a↓0)↓{-2}$
in the negabinary system.\xskip (See also exercise 7.1--18 for a trickier
solution.)\xskip (b) Subtract $(\ldotsm b↓30b↓10)↓2$
from $(\ldotsm b↓40b↓20b↓0)↓2$ in the binary system.
\ansno 13. $(1.909090\ldotsm)↓{-10} = (0.090909\ldotsm)↓{-10} = {1\over 11}$.
\ansno 14. \hbox to 329pt{$\def\\{\hskip 4.5pt}\hskip0pt plus 200pt
\eqalign{1\\1\\3\\2\\1⊗\qquad[5-4i]\cr
1\\1\\3\\2\\1⊗\qquad[5-4i]\cr
\noalign{\vskip -7.6pt}
\vbox{\hrule width 40.5pt}\cr
\noalign{\vskip-1.75pt}
1\\1\\3\\2\\1\cr
1\\1\\2\\0\\2\\\\\cr
1\\2\\1\\2\\3\\\\\\\\\cr
1\\1\\3\\2\\1\\\\\\\\\\\\\cr
1\\1\\3\\2\\1\\\\\\\\\\\\\\\\\cr
\noalign{\vskip 3pt\hrule width 76.5pt\vskip 3pt}
0\\1\\0\\3\\1\\1\\2\\0\\1⊗\qquad[9-40i]\cr}\hskip0pt plus 200pt$}
\ansno 15. $[-{10\over 11}, {1\over 11}]$, and the rectangle
shown in Fig.\ A--6.
\topinsert{\vskip 40mm
\moveright 190pt\hbox par 128pt{\caption Fig.\ A--6.
Fundamental region for quater-imaginary numbers.}}
\ansno 16. It is tempting to try to do this in a very simple
way, by using the rule $2 = (1100)↓{i-1}$ to take care of carries;
but that leads to a nonterminating method if, for example, we try to add 1
to $(11101)↓{i-1} = -1$.
The following solution does the job by providing
four related algorithms (namely for adding or subtracting 1
or $i$). If $α$ is a string of zeros and ones, let $α↑P$ be
a string of zeros and ones such that $(α↑P)↓{i-1} = (α)↓{i-1}
+ 1$; and let $α↑{-P}$, $α↑Q$, $α↑{-Q}$ be defined similarly, with
$-1$, $+i$, and $-i$ respectively in place of $+1$. Then
$$\eqalign{(α0)↑P⊗=α1;\cr(αx0)↑{-P}⊗=α↑{-Q}x1;\cr}\quad
\eqalign{(αx1)↑P⊗=α↑Qx0.\cr(α1)↑{-P}⊗=α0.\cr}\qquad\qquad
\eqalign{(α0)↑Q⊗=α↑P1;\cr(α0)↑{-Q}⊗=α↑Q1;\cr}\quad
\eqalign{(α1)↑Q⊗=α↑{-Q}0.\cr(α1)↑{-Q}⊗=α↑{-P}0.\cr}$$
Here $x$ stands for either 0 or 1, and the strings are
extended on the left with zeros if necessary. The processes
will clearly always terminate. Hence every number of the form
$a + bi$ with $a$ and $b$ integers is representable in the $i - 1$
system.
\ansno 17. No (in spite of exercise↔28); the number $-1$ cannot be so represented.
This can be proved by constructing a set $S$ as in Fig.↔1. We do have
the representations $-i = (0.1111\ldotsm)↓{1+i}$, $i = (100.1111\ldotsm
)↓{1+i}$.
\lineskip 0pt
\ansno 18. Let $S↓0$ be the set of points $(a↓7a↓6a↓5a↓4a↓3a↓2a↓1a↓0)↓{i-1}$,
where each $a↓k$ is 0 or 1.\xskip (Thus, $S↓0$ is given by the 256 interior
dots shown in Fig.↔1, if that picture is multiplied by 16.)\xskip
We first show that $S$ is closed: If $y↓1$, $y↓2$, $\ldots$ is an
infinite subset of $S$, we have $y↓n = \sum ↓{k≥1\lower 1pt\null} a↓{nk}16↑{-k}$,
where each $a↓{nk}$ is in $S↓0$. Construct a tree whose nodes
are $(a↓{n1}, \ldotss , a↓{nr})$, for $1 ≤ r ≤ n$, and let a
node of this tree be an ancestor of another node if it is an
initial subsequence of that node. By the infinity lemma this
tree has an infinite path $(a↓1, a↓2, a↓3, \ldotss)$, and it follows that $\sum
↓{k≥1} a↓k16↑{-k}$ is a limit point of $\{y↓1, y↓2, \ldotss\}$
in $S$.
\lineskip 1pt
By the answer to exercise↔16, all numbers of the
form $(a + bi)/16↑k$ are representable, when $a$ and $b$ are
integers. Therefore if $x$ and $y$ are arbitrary reals and $k
≥ 1$, the number $z↓k = (\lfloor 16↑kx\rfloor + \lfloor 16↑ky\rfloor
i)/16↑k$ is in $S + m + ni$ for some integers $m$ and $n$. It
can be shown that $S + m + ni$ is bounded away from the origin
when $(m, n) ≠ (0, 0)$. Consequently if $|x|$ and $|y|$ are
fixed and $k$ is sufficiently large, we have $z↓k
\in S$, and $\lim↓{k→∞}z↓k = x + yi$ is in $S$.
\ansno 19. If $m > u$ or $m < l$, find $a \in D$ such that $m
≡ a \modulo b$; the desired representation will be a representation
of $m↑\prime = (m - a)/b$ followed by $a$. Note that $m > u$
implies $l < m↑\prime < m$; $m < l$ implies $m < m↑\prime < u$;
so the algorithm terminates.
[There are no solutions when $b = 2$. The representation
will be unique iff $0 \in D$; nonunique representation occurs
for example when $D = \{-3, -1, 7\}$, $b = 3$, since $(α)↓3 =
(\overline377\overline3α)↓3$. When $b ≥ 3$ it is not difficult
to show that there are exactly $2↑{b-3}$ solution sets $D$ in
which $|a| < b$ for all $a \in D$. Furthermore the set $D =
\{0$, 1, $2 - ε↓2b↑n$, $3 - ε↓3b↑n$, $\ldotss$, $b - 2 - ε↓{b-2}b↑n$, $b
- 1 - b↑n\}$ gives unique representations, for all $b ≥ 3$ and
$n ≥ 1$, when each $ε↓j$ is 0 or 1. Reference: {\sl Proc.\ IEEE Symp.\ Comp.\
Arith.\ \bf4} (1978), 1--9.]
\ansno 20. (a)\9 $0.\overline1\overline1\overline1\,\ldots=\overline1.888\,
\ldots=\overline18.{111\atop777}\,\ldots=\overline18{1\atop7}.{222\atop666}\,
\ldots=\cdots=\overline18{123456\atop765432}.{777\atop111}\,\ldots$ has
nine representations.\xskip (b) A ``$D$-fraction'' $.a↓1a↓2\,\ldots$
always lies between $-1/9$ and $+71/9$. Suppose $x$ has ten or more
$D$-decimal representations. Then for sufficiently large $k$,
$10↑kx$ has ten representations that differ to the left of the
decimal point: $10↑kx = n↓1 + f↓1 = \cdots = n↓{10} + f↓{10}$
where each $f↓j$ is a $D$-fraction. By uniqueness of integer representations,
the $n↓j$ are distinct, say $n↓1 < \cdots < n↓{10}$, hence $n↓{10}
- n↓1 ≥ 9$; but this implies $f↓1 - f↓{10} ≥ 9 > 71/9 - (-1/9)$,
a contradiction.\xskip (c) Any number of the form $0.a↓1a↓2 \ldotsm
$, where each $a↓j$ is $-1$ or 8, equals $\overline1.a↓1↑\prime a↓2↑\prime
\,\ldots$ where $a↑\prime↓{j} = a↓j + 9$ (and it even has six
{\sl more} representations $\overline18.a↑{\prime\prime}↓1a↑{\prime\prime}↓2
\ldotsm$, etc.).
\ansno 21. We can convert to such a representation by using
a method like that suggested in the test for converting to balanced
ternary.
In contrast to the systems of exercise↔20, zero
can be represented in infinitely many ways, all obtained from
${1\over 2} + \sum ↓{k≥1}(-4{1\over 2}) \cdot 10↑{-k}$ (or
from the negative of this representation) by multiplying it
by a power of ten. The representations of unity are $1{1\over
2} - {1\over 2}$, ${1\over 2} + {1\over 2}$, $5 - 3{1\over
2} - {1\over 2}$, $5 - 4{1\over 2} + {1\over 2}$, $50 - 45
- 3{1\over 2} - {1\over 2}$, $50 - 45 - 4{1\over 2} + {1\over
2}$, etc., where $\pm {1\over 2} = (\pm 4{1\over 2})(10↑{-1}
+ 10↑{-2} + \cdotss)$.\xskip [{\sl AMM \bf 57} (1950), 90--93.]
\ansno 22. Given some approximation $b↓n \ldotsm
b↓1b↓0$ with error $\sum ↓{0≤k≤n}b↓k10↑k
- x > 10↑{-t}$ for $t > 0$, we will show how to reduce the error
by approximately $10↑{-t}$.\xskip (The process can be started by finding
a suitable $\sum ↓{0≤k≤n} b↓k10↑k > x$; then a finite
number of reductions of this type will make the error less than
$ε$.)\xskip Simply choose $m > n$ so large that the decimal representation
of $-10↑mα$ has a one in position $10↑{-t}$ and no ones in positions
$10↑{-t+1}$, $10↑{-t+2}$, $\ldotss$, $10↑n$. Then $10↑mα + ($a suitable
sum of powers of 10 between $10↑m$ and $10↑n) + \sum ↓{0≤k≤n}
b↓k10↑k \approx \sum ↓{0≤k≤n}b↓k10↑k - 10↑{-t}$.
\ansno 23. The set $S = \leftset\sum ↓{k≥1} a↓kb↑{-k}\relv a↓k \in
D\rightset$ is closed as in exercise↔18, hence measurable, and in fact
it has positive measure. Since $bS = \union↓{a\in D}(a + S)$, we
have $b\mu (S) = \mu (bS) ≤ \sum ↓{a\in D}\mu (a + S) = \sum
↓{a\in D}\mu (S) = b\mu (S)$, and we must have $\mu \biglp
(a + S) ∩ (a↑\prime + S)\bigrp = 0$ when $a ≠ a↑\prime \in D$.
Now $T$ has measure zero
since it is a union of countably many sets of the form $10↑k\biglp
n + ((a + S) ∩ (a↑\prime + S))\bigrp$, $a ≠ a↑\prime $, each
of measure zero.
[The set $T$ cannot be empty, since the real numbers
cannot be written as a countable union of disjoint, closed,
bounded sets; cf.\ {\sl AMM \bf 84} (1977), 827--828.
If $D$ has less than $b$ elements, the set of
numbers representable with radix $b$ and digits from $D$ has
measure zero. If $D$ has more than $b$ elements and represents all
reals, $T$ has infinite measure.]
\ansno 24. $\leftset 2a\cdot10↑k+a↑\prime\relv0≤a<5,0≤a↑\prime<2\rightset$
or $\leftset 5a↑\prime\cdot10↑k+a\relv0≤a<5,0≤a↑\prime<2\rightset$,
for $k ≥ 0$.\xskip [R. L. Graham has shown
that there are no more sets of integer digits with these properties.
And Andrew Odlyzko has shown that the restriction to integers
is superflous, in the sense that if the smallest two elements
of $D$ are 0 and 1, all the digits must be integers.\xskip {\sl Proof:}
Let $S=\leftset\sum↓{k<0}a↓kb↑k\relv a↓k\in D\rightset$
and $X = \leftset(a↓n \ldotsm a↓0)↓b \relv a↓k \in D\rightset$; then $[0, ∞) =
\union↓{x\in X}(x + S)$, and $(x + S) ∩ (x↑\prime + S)$ has measure
zero for $x ≠ x↑\prime \in X$. We have $(0, 1) \subset S$, and
by induction on $m$ we will prove that $(m, m + 1) \subset x↓m
+ S$ for some $x↓m \in X$. Let $x↓m\in X$ be such that $(m, m
+ ε)∩ (x↓m + S)$ has positive measure for all $ε > 0$. Then $x↓m
≤ m$, and $x↓m$ must be an integer lest $x↓{\lfloor x↓m\rfloor}
+ S$ overlap $x↓m + S$ too much. If $x↓m > 0$, the fact that
$(m - x↓m, m - x↓m + 1) ∩ S$ has positive measure implies by
induction that this measure is 1, and $(m, m + 1) \subset x↓m
+ S$ since $S$ is closed. If $x↓m = 0$ and $(m, m + 1) \not
\subset S$, we must have $m < x↑\prime↓{m}
< m + 1$ for some $x↑\prime↓{m} \in X$, where $(m, x↑\prime↓{m})
\subset S$; but then $1 + S$ overlaps $x↑\prime↓{m} + S$. Reference:
{\sl Proc.\ London Math.\ Soc.\ \bf18} (1978), to appear.]
[If we drop the restriction $0 \in D$, there {\sl
are} many other cases, some of which are quite interesting,
especially the sets $\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}$, $\{1,
2, 3, 4, 5, 51, 52, 53, 54, 55\}$, and $\{2, 3, 4, 5, 6, 52, 53,
54, 55, 56\}$. Alternatively if we allow negative digits we obtain
many other solutions by the method of exercise↔19, plus further
sets like $\{-1$, 0, 1, 2, 3, 4, 5, 6, 7, $18\}$ which don't meet
the conditions of that exercise; it appears hopeless to find
a nice characterization of all solutions with negative digits.]
\ansno 25. A positive number whose base $b$ representation has
$m$ consecutive $(b - 1)$'s to the right of the decimal point
must have the form $c/b↑n + (b↑m - \theta )/b↑{n+m}$, where
$c$ and $n$ are nonnegative integers and $0 < \theta ≤1$. So if $u/v$
has this form, we find that $b↑{m+n}u = b↑mcv + b↑mv - \theta
v$. Therefore $\theta v$ is an integer that is a multiple of
$b↑m$. But $0 < \theta v ≤ v < b↑m$.\xskip $\biglp$There can be arbitrarily
long runs of other digits $aaaaa$, if $0 ≤ a < b - 1$,
for example in the representation of $a/(b - 1).\bigrp$
%folio 731 galley 2a (C) Addison-Wesley 1978 *
\ansno 26. The proof of ``sufficiency'' is a straightforward
generalization of the usual proof for base $b$, by successively
constructing the desired representation. The proof of ``necessity''
breaks into two parts: If $β↓{n+1}$ is greater than $\sum ↓{k≤n}
c↓kβ↓k$ for some $n$, then $β↓{n+1} - ε$ has no representation
for small $ε$. If $β↓{n+1} ≤ \sum ↓{k≤n} c↓kβ↓k$ for all
$n$, but equality does not always hold, we can show that there are
two representations for certain $x$.\xskip [See {\sl Transactions
of the Royal Society of Canada}, series III, {\bf 46} (1952),
45--55.]
\ansno 27. Proof by induction on $|n|$: If $n$ is even we must
take $e↓0 > 0$, and the result follows by induction, since
$n/2$ has a unique such representation. If $n$ is odd, we must
take $e↓0 = 0$, and the problem reduces to representing $-(n
- 1)/2$; if the latter quantity is either zero or one, there
is obviously only one way to proceed, otherwise it has a unique reversing
representation by induction.
[It follows that every positive integer has exactly {\sl two} such
representations with {\sl decreasing} exponents $e↓0>e↓1>\cdots>e↓t$:
one with $t$ even and the other with $t$ odd.]
\ansno 28. A proof like that of exercise↔27 may be given. Note
that $a + bi$ is a multiple of $1 + i$ by a complex integer
if and only if $a + b$ is even.
\ansno 29. It suffices to prove that any collection $\{T↓0, T↓1,
T↓2, \ldotss\}$ satisfying Property B may be obtained by collapsing
some collection $\{S↓0, S↓1, S↓2, \ldotss\}$, where $S↓0 = \{0, 1,
\ldotss , b - 1\}$ and all elements of $S↓1$, $S↓2$, $\ldots$ are
multiples of $b$.
To prove the latter statement, we may assume that
$1 \in T↓0$ and that there is a least element $b > 1$ such
that $b \notin T↓0$. We will prove, by induction on $n$,
that if $nb \not\in T↓0$, then $nb + 1$, $nb + 2$, $\ldotss$,
$nb + b - 1$ are not in any of the $T↓j$'s; but if $nb \in
T↓0$, then so are $nb + 1$, $\ldotss$, $nb + b - 1$. The result then
follows with $S↓1 = \leftset nb \relv nb \in T↓0\rightset$, $S↓2 = T↓1$, $S↓3 =
T↓2$, etc.
If $nb \notin T↓0$, then $nb = t↓0 + t↓1+
\cdotss$, where $t↓1$, $t↓2$, $\ldots$ are multiples of $b$; hence
$t↓0 < nb$ is a multiple of $b$. By induction, $(t↓0 + k) +
t↓1 + t↓2 + \cdots$ is the representation of $nb
+ k$, for $0 < k < b$; hence $nb + k \notin T↓j$ for any
$j$.
If $nb \in T↓0$ and $0 < k < b$, let the representation
of $nb + k$ be $t↓0 + t↓1 + \cdotss$. We cannot have $t↓j =
nb + k$ for $j ≥ 1$, lest $nb + b$ have two representations
$(b - k) + \cdots + (nb + k) + \cdots = (nb) + \cdots + b +
\cdotss$. By induction, $t↓0 \mod b = k$; and the representation
$nb = (t↓0 - k) + t↓1 + \cdots$ implies that $t↓0
= nb + k$.
[Reference: {\sl Nieuw Archief voor Wiskunde} (3)
{\bf 4} (1956), 15--17. A finite analog of this result was derived
by P. A. MacMahon, {\sl Combinatory Analysis \bf 1} (1915),
217--223.]
\ansno 30. (a)\9 Let $A↓j$ be the set of numbers $n$ whose representation
does not involve $b↓j$; then by the uniqueness property, $n
\in A↓j$ iff $n + b↓j \notin A↓j$. Consequently $n \in
A↓j$ iff $n + 2b↓j \in A↓j$. It follows that, for $j ≠ k$, $n \in
A↓j ∩ A↓k$ iff $n + 2b↓jb↓k \in A↓j ∩ A↓k$. Let $m$ be the number
of integers $n \in A↓j ∩ A↓k$ such that $0 ≤ n < 2b↓jb↓k$. Then
this interval contains exactly $m$ integers that are in
$A↓j$ but not $A↓k$, exactly $m$ in $A↓k$ but not $A↓j$, and
exactly $m$ in neither $A↓j$ nor $A↓k$; hence $4m = 2b↓jb↓k$.
Therefore $b↓j$ and $b↓k$ cannot both be odd. But at least one
$b↓j$ is odd, of course, since odd numbers can be represented.\xskip
(b) According to (a) we can renumber the $b$'s so that $b↓0$ is odd and
$b↓1$, $b↓2$, $\ldots$ are even; then ${1\over2}b↓1$, ${1\over2}b↓2$,
$\ldots$ must also be a binary basis, and the process can be iterated.
(c)\9 If it is a binary basis, we must have positive and negative $d↓k$'s for
arbitrarily large $k$, in order to represent $\pm2↑n$ when $n$ is large.
Conversely, the following algorithm may be used:
\algstep S1. [Initialize.] Set $k←0$.
\algstep S2. [Done?] If $n=0$, terminate.
\algstep S3. [Choose.] If $n$ is odd, include $2↑kd↓k$ in the representation,
and replace $n$ by $(n-d↓k)/2$. Otherwise set $n←n/2$.
\algstep S4. [Advance $k$.] Increase $k$ by 1 and return to S2.\quad\blackslug
\yskip At each step the choice is forced; furthermore
step S3 decreases $|n|$ unless $n=-d↓k$, hence the algorithm must terminate.
(d)\9 Two iterations of steps S2--S4 in the preceding algorithm will transform
$4m→m$, $4m+1→m+5$, $4m+2→m+7$, $4m+3→m-1$. Arguing as in exercise↔19, we
need only show that the algorithm terminates for $-2≤n≤8$; all other values of
$n$ are moved toward this interval. In this
range $3 → -1 → -2 → 6 → 8 → 2 → 7 → 0$ and $4 → 1 → 5 → 6$.
Thus $1 = 7 \cdot 2↑0 - 13 \cdot 2↑1 + 7 \cdot 2↑2 - 13 \cdot
2↑3 - 13 \cdot 2↑5 - 13 \cdot 2↑9 + 7 \cdot 2↑{10}$.
{\sl Note:} The choice $d↓0$, $d↓1$, $d↓2$, $\ldots
= 5$, $-3$, 3, 5, $-3$, 3, $\ldots$ also yields a binary basis. For
further details see {\sl Math.\ Comp.\ \bf 18} (1964), 537--546;
A. D. Sands, {\sl Acta Mathematica}, Acad.\ Sci.\ Hung., {\bf 8}
(1957), 65--86.
\ansno 31. (See also the related exercises 3.2.2--11, 4.3.2--13,
4.6.2--22.)
(a)\9 By multiplying numerator and denominator by suitable powers
of↔2, we may assume that $u = (\ldotsm u↓2u↓1u↓0)↓2$ and $v =
(\ldotsm v↓2v↓1v↓0)↓2$ are 2-adic integers,
where $v↓0 = 1$. The following computational
method now determines $w$, using the notation $u↑{(n)}$ to stand
for the integer $(u↓{n-1} \ldotsm u↓0)↓2 = u \mod 2↑n$ when
$n > 0$:
Let $w↓0 = u↓0$. For $n = 1$, 2, $\ldotss$, assume
that we have found an integer $w↑{(n)} = (w↓{n-1} \ldotsm w↓0)↓2$
such that $u↑{(n)} ≡ v↑{(n)}w↑{(n)} \modulo{2↑n}$.
Then we have $u↑{(n+1)} ≡ v↑{(n+1)}w↑{(n)}
\modulo {2↑n}$, hence $w↓n = 0$ or 1 according as
$(u↑{(n+1)} - v↑{(n+1)}w↑{(n)})\mod 2↑{n+1}$ is 0 or $2↑n$.
(b)\9 Find the smallest integer $k$ such that $2↑k
≡ 1 \modulo {2n + 1}$. Then we have $1/(2n + 1) = m/(2↑k - 1)$ for
some integer $m$, $1 ≤ m < 2↑{k-1}$. Let $α$ be the $k$-bit binary
representation of $m$; then $(0.ααα\ldotsm)↓2$ times $2n + 1$
is $(0.111\ldotsm)↓2 = 1$ in the binary system, and $(\ldotsm ααα)↓2$
times $2n + 1$ is $(\ldotsm 111)↓2 = -1$ in the 2-adic system.
(c)\9 If $u$ is rational, say $u = m/2↑en$ where
$n$ is odd and positive, the 2-adic representation of $u$ is
periodic, because the set of numbers with periodic expansions
includes $-1/n$ and is closed under the operations of negation,
division by 2, and addition. Conversely, if $u↓{N+λ}=u↓N$ for all
$N≥\mu$, the 2-adic number $(2↑λ-1)2↑{-\mu}u$ is an integer.
(d)\9 The square of any number of the form $(\ldotsm
u↓2u↓11)↓2$ has the form $(\ldotsm 001)↓2$, hence the condition
is necessary. To show the sufficiency, we can use the following procedure
to compute $v = \sqrt{n}$ when $n \mod 8 = 1$:
\algstep H1. [Initialize.] Set $m ← (n - 1)/8$, $k ← 2$, $v↓0 ←
1$, $v↓1 ← 0$, $v ← 1$.\xskip (During this algorithm we will have $v=(v↓{k-1}\ldotsm
v↓1v↓0)↓2$ and $v↑2=n-2↑{k+1}m$.)
\algstep H2. [Transform.] If $m$ is even, set $v↓k ← 0$,
$m ← m/2$. Otherwise set $v↓k ← 1$, $m ← (m - v - 2↑{k-1})/2$, $v
← v + 2↑k$.
\algstep H3. [Advance $k$.] Increase $k$ by 1 and return to H2.\quad\blackslug
\ansno 32. A generalization appears
in {\sl Math.\ Comp.\ \bf 29} (1975), 84--86.
\ansno 33. Let $K↓n$ be the set of all such $n$-digit numbers, so that
$k↓n=\|K↓n\|$. If $S$ and $T$ are any finite sets of integers, we shall say
$S~T$ if $S=T+x$ for some integer $x$, and we shall write $k↓n(S)=\|\Kscr↓n(S)\|$,
where $\Kscr(S)$ is the family of all subsets of $K↓n$ that are $~S$. When
$n=0$, we have $k↓n(S)=0$ unless $\|S\|≤1$, since zero is the only ``0-digit''
number. When $n≥1$ and $S=\{s↓1,\ldotss,s↓r\}$, we have $$\Kscr(S)=
\union↓{0≤j<b}\union↓{(a↓1,\ldotss,a↓r)}\mathopen{\vcenter{\hbox{\:@\char'10}}}\,
\{t↓1b+a↓1,\ldotss,t↓rb+a↓r}\;\left|\;\{t↓1,\ldotss,t↓r\}\in K↓{n-1}\biglp
\leftset(S↓i+j-a↓i)/b\relv1≤i≤r\rightset\bigrp\,\mathclose{\vcenter{
\hbox{\:@\char'11}}},$$
where the inner union is over all sequences of digits $(a↓1,\ldotss,a↓r)$ such
that $a↓i≡S↓i+j\modulo b$ for $1≤i≤r$. In this formula we require $t↓i-t↓{i↑\prime}
=(s↓i-a↓i)/b-(s↓{i↑\prime}-a↓{i↑\prime})/b$ for $1≤i<i↑\prime≤r$, so that
the naming of subscripts is uniquely determined. By the principle of
\α{inclusion and exclusion}, therefore, we have $k↓n(S)=\sum↓{0≤j<b}\sum↓{m≥1}
(-1)↑{m-1}f(S,m,j)$, where $f(S,m,j)$ is the number of sets of integers that can
be expressed as $\{t↓1b+a↓1,\ldotss,t↓rb+a↓r\}$ in the above manner for $m$
different sequences $(a↓1,\ldotss,a↓r)$, summed over all choices of $m$ different
sequences $(a↓1,\ldotss,a↓r)$. Given $m$ different sequences $(a↓1↑{(l)},\ldotss,
a↓r↑{(l)})$ for $1≤l≤m$, the number of such sets is $k↓{n-1}\biglp\{(s↓i+j
-a↓i↑{(l)})/b\relv 1≤i≤r,1≤l≤m\}\bigrp$. Thus there is a collection of sets
$\Gscr(S)$ such that $$k↓n(S)=\sum↓{G\in\Gscr(S)}c↓G\,k↓{n-1}(G),$$
where each $c↓G$ is an integer. Furthermore if $G\in \Gscr(S)$ we have
$\min G≥(\min S-\max D)/b$ and $\max G≤(\max S+b-1-\min D)/b$. Thus we have
simultaneous recurrence relations for the sequences $\langle k↓n(S)\rangle$,
where $S$ is any nonempty integer subset of $[l,u+1]$ in the notation of
exercise↔19. Since $k↓n=k↓n(S)$ for any one-element set $S$, the sequence
$\langle k↓n\rangle$ appears these recurrences. The coefficients $c↓G$ can
be computed from the first few values of $k↓n(S)$, so we can obtain a system of
equations defining the generating functions $k↓S(z)=\sum k↓n(S)z↑n=
\biglp\|S\|≤1\bigrp+z\sum↓{G\in\Gscr(S)}k↓G(z)$.
For example, when $D=\{-1,0,3\}$ and $b=3$ we have $l=-{3\over2}$ and $u={1\over2}$,
so the relevant sets $S$ are $\{0\}$, $\{0,1\}$, $\{-1,1\}$, and
$\{-1,0,1\}$. The corresponding sequences for $n≤3$ are $\langle1,3,8,21\rangle$,
$\langle0,1,3,8\rangle$, $\langle0,0,1,4\rangle$, and $\langle0,0,0,0\rangle$;
so we obtain
$$\eqalign{k↓0(z)⊗=1+z\biglp3k↓0(z)-k↓{01}(z)\bigrp,\cr
k↓{01}(z)⊗=zk↓0(z),\cr}\qquad\qquad
\eqalign{k↓{02}(z)⊗=z\biglp k↓{01}(z)+k↓{02}(z)\bigrp,\cr
l↓{012}(z)⊗=0,\cr}$$
and $k(z)=1/(1-3z+z↑2)$.
\inx{Fibonacci number}
In this case $k↓n=F↓{2n+2}$ and $k↓n\biglp\{0,2\}\bigrp=F↓{2n-1}-1$.
%folio 733 galley 2b (C) Addison-Wesley 1978 *
\ansbegin{4.2.1}
\def\\{\spose{\raise 3.375pt\hbox{\sl\hskip.05em\char'31}}} % \\h will be h bar
\ansno 1. $N = (62, +.60\ 22\ 52\ 00)$;
$\\h = (37, +.10\ 54\ 50\ 00)$. Note that $10\\h$ would be $(38, +.01\
05\ 45\ 00)$.
\ansno 2. $b↑{E-q}(1 - b↑{-p})$, $b↑{-q-p}$;\xskip $b↑{E-q}(1 - b↑{-p})$,
$b↑{-q-1}$.
\ansno 3. When $e$ does not have its smallest value,
the most significant ``one'' bit (which appears in all such
normalized numbers) need not appear in the computer word.
\ansno 4. $(51, +.10209877)$; $(50, +.12346000)$; $(53, +.99999999)$.
The third answer would be $(54, +.10000000)$ if the first operand
had been $(45, -.50000000)$.
\ansno 5. If $x ~ y$ and $m$ is an integer then $mb + x ~ mb
+ y$. Furthermore $x~y$ implies $x/b~y/b$, by considering all
possible cases. Another crucial property is that $x$ and $y$ will
round to the same integer, whenever $x ~ y$.
Now if $b↑{-p-2}F↓v ≠ f↓v$ we must have $(b↑{p+2}f↓v)\mod
b ≠ 0$; hence the transformation leaves $f↓v$ unchanged unless
$e↓u - e↓v ≥ 2$. Since $u$ was normalized, it is nonzero and
$|f↓u + f↓v| > b↑{-1} - b↑{-2} ≥ b↑{-2}$: the leading nonzero
digit of $f↓u + f↓v$ must be at most two places to the right
of the radix point, and the rounding operation will convert
$b↑{p+j}(f↓u + f↓v)$ to an integer, where $j ≤ 1$. The proof
will be complete if we can show that $b↑{p+j+1}(f↓u + f↓v)
~ b↑{p+j+1}(f↓u + b↑{-p-2}F↓v)$. By the previous paragraph,
we have $b↑{p+2}(f↓u + f↓v) ~ b↑{p+2}f↓u + F↓v = b↑{p+2}(f↓u
+ b↑{-p-2}F↓v)$, which implies the desired result for all $j
≤ 1$.
Note that, when $b > 2$ is even, such an integer
$F↓v$ always exists; but when $b = 2$ we require $p + 3$ bits
(let $2F↓v$ be an integer). When $b$ is odd, an integer $F↓v$
always exists except in the case of division, when a remainder
of ${1\over 2}b$ is possible.
\ansno 6. (Consider the case $e↓u = e↓v$, $f↓u = -f↓v$ in
Program↔A\null.)\xskip Register↔A retains its previous sign, as in \.{ADD}.
\ansno 7. Say that a number is normalized iff it is zero or
its fraction part satisfies ${1\over 6} < |f| < {1\over 2}$.
A $(p + 1)$-place accumulator suffices for addition and subtraction;
rounding (except during division) is equivalent to truncation.
A very pleasant system indeed! We might represent numbers with
excess-zero exponent, inserted between the first and subsequent
digits of the fraction, and complemented if the fraction is
negative, so that fixed-point order is preserved.
\ansno 8. (a)\9 $(06, +.12345679) \oplus (06, -.12345678)$,\xskip $(01,
+.10345678) \oplus (00, -.94000000)$;\xskip\penalty1000 (b) $(99, +.87654321) \oplus
\hbox{itself}$,\xskip $(99, +.99999999) \oplus (91, +.50000000)$.
\ansno 9. $a = (-50, +.10000000)$, $b = (-41, +.20000000)$, $c =
a$, $d = (-41, +.80000000)$, $y = (11, +.10000000)$.
\ansno 10. $(50, +.99999000) \oplus (55, +.99999000)$.
%folio 735 galley 3a (C) Addison-Wesley 1978 *
\ansno 11. $(50, +.10000001) \otimes (50, +.99999990)$.
\ansno 12. If $0 < |f↓u| < |f↓v|$, then $|f↓u| ≤ |f↓v| - b↑{-p}$;
hence $1/b < |f↓u/f↓v| ≤ 1 - b↑{-p}/|f↓v| < 1 - b↑{-p}$. If
$0 < |f↓v| ≤ |f↓u|$, we have $1/b < |f↓u/f↓v|/b ≤ \biglp(1 - b↑{-p})/(1/b)\bigrp/b
= 1 - b↑{-p}$.
\ansno 13. See J. Michael Yohe, {\sl IEEE Transactions} {\bf C--22}
(1973), 577--586; cf.\ also exercise 4.2.2--24.
\mixans 14. {⊗FIX⊗STJ⊗9F⊗Float-to-fix subroutine:\cr
⊗⊗STA⊗TEMP\cr
⊗⊗LD1⊗TEMP(EXP)⊗$\rI1 ← e$.\cr
⊗⊗SLA⊗1⊗$\rA ← \pm\,f\,f\,f\,f\,0$.\cr
⊗⊗JAZ⊗9F⊗Is input zero?\cr
⊗⊗DEC1⊗1\cr
⊗⊗CMPA⊗=0=(1:1)⊗If leading byte is zero,\cr
⊗⊗JE⊗*-4⊗\quad shift left again.\cr
\\
⊗⊗ENN1⊗-Q-4,1\cr
⊗⊗J1N⊗FIXOVFLO⊗Is magnitude too large?\cr
\\
⊗⊗ENTX⊗0\cr
⊗⊗SRAX⊗0,1\cr
\\
⊗⊗CMPX⊗=1//2=\cr
⊗⊗JL⊗9F\cr
⊗⊗JG⊗*+2\cr
⊗⊗JAO⊗9F\cr
\\
⊗⊗STA⊗*+1(0:0)⊗Round, if necessary.\cr
⊗⊗INCA⊗1⊗Add $\pm 1$ (overflow is impossible).\cr
⊗9H⊗JMP⊗*⊗Exit from subroutine.\quad\blackslug\cr}
\mixans 15. {⊗FP⊗STJ⊗EXITF⊗Fractional part subroutine:\cr
⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
⊗⊗STA⊗TEMP⊗$\.{TEMP} ← u$.\cr
⊗⊗ENTX⊗0\cr
⊗⊗SLA⊗1⊗$\rA ← f↓u$.\cr
\\
⊗⊗LD2⊗TEMP(EXP)⊗$\rI2 ← e↓u$.\cr
⊗⊗DEC2⊗Q\cr
⊗⊗J2NP⊗*+3\cr
⊗⊗SLA⊗0,2⊗Remove integer part of $u$.\cr
⊗⊗ENT2⊗0\cr
⊗⊗JANN⊗1F\cr
\\
⊗⊗ENN2⊗0,2⊗Fraction is negative: find\cr
⊗⊗SRAX⊗0,2⊗\quad its complement.\cr
⊗⊗ENT2⊗0\cr
⊗⊗JAZ⊗*+2\cr
⊗⊗INCA⊗1\cr
⊗⊗ADD⊗WM1⊗Add word size minus one.\cr
\\
⊗1H⊗INC2⊗Q⊗Prepare to normalize the answer.\cr
⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\cr
⊗8H⊗EQU⊗1(1:1)\cr
⊗WM1⊗CON⊗8B-1,8B-1(1:4)⊗Word size minus one\quad\blackslug\cr}
\ansno 16. If $|c| ≥ |d|$, then set $r ← d \odiv c$,
$s ← c \oplus (r \otimes d)$; $x ← \biglp a \oplus (b \otimes r)\bigrp
\odiv s$, $y ← \biglp b \ominus (a \otimes r)\bigrp \odiv
s$. Otherwise set $r ← c \odiv d$, $s ← d \oplus (r \otimes
c)$; $x ← \biglp (a \otimes r) \oplus b\bigrp \odiv s$, $y ← \biglp
(b \otimes r) \ominus a\bigrp \odiv s$. Then $x + iy$ is the
desired approximation to $(a + bi)/(c + di)$.\xskip [{\sl CACM \bf 5} (1963),
435. Other algorithms for complex arithmetic and function evaluation
are given by P. Wynn, {\sl BIT \bf 2} (1962), 232--255; see
also Paul Friedland, {\sl CACM \bf 10} (1967), 665.]
\ansno 17. See Robert Morris, {\sl IEEE Transactions} {\bf C--20}
(1971), 1578--1579. Error analysis is more difficult with such
systems, so interval arithmetic is correspondingly more desirable.
\ansno 18. For positive numbers: shift fraction left until $f↓1
= 1$, then round, then if the fraction is zero (rounding overflow)
shift it right again. For negative numbers: shift fraction left
until $f↓1 = 0$, then round, then if the fraction is zero (rounding
underflow) shift it right again.
\ansno 19. $\biglp 43 + (1$ if $e↓v < e↓u) - (1$ if fraction overflow$)
- (10$ if result zero$) + (4$ if magnitude
is rounded up$) + (1$ if first rounding digit is $b/2) + (5$ if
rounding digits are $b/2\ 0 \ldotsm 0) + (7$ if rounding overflow$)
+ 7N + A(-1 + (11$ if $N > 0))\bigrp u$, where $N$ is the number
of left shifts during normalization, $A = 1$ if rX receives
nonzero digits (otherwise $A = 0$). The maximum time of $73u$
occurs for example when $$u = +50\ 01\ 00\ 00\ 00,\quad v = -46\ 49\ 99\
99\ 99,\quad b = 100.$$ [The average time, considering the
data in Section 4.2.4, will be about $45{1\over 2}u$.]
%folio 738 galley 3b (C) Addison-Wesley 1978 *
\ansbegin{4.2.2}
\ansno 1. $u \ominus v = u \oplus -v
= -v \oplus u = -(v \oplus -u) = -(v \ominus u)$.
\ansno 2. $u \oplus x ≥ u \oplus 0 = u$, by (8), (2), (6); hence
by (8) again, $(u \oplus x) \oplus v ≥ u \oplus v$. Similarly,
(8) and (6) together with (2) imply that $(u \oplus x) \oplus
(v \oplus y) ≥ (u \oplus x) \oplus v$.
\ansno 3. $u = 8.0000001$, $v = 1.2500008$, $w = 8.0000008$;
$(u \otimes v) \otimes w = 80.000064$, $u \otimes (v \otimes w)=
80.000057$.
\ansno 4. Yes; let $1/u \approx v = w$, where $v$
is large.
\ansno 5. Not always; in decimal arithmetic take $u = v = 9$.
\ansno 6. (a)\9 Yes.\xskip (b) Only for $b + p ≤ 4$ (try
$u = 1 - b↑{-p}$).\xskip [W. M. Kahan observes that the identity {\sl
does} hold whenever $b↑{-1} ≤ f↓u ≤ b↑{-1/2}$. It follows that
$1 \odiv \biglp 1 \odiv (1 \odiv u)\bigrp = 1 \odiv
u$ for all $u$.]
\def\expcirc#1{{\hbox to 9pt{\hfill
\hbox to 0pt{\hskip 0pt minus 100pt\raise 5.0833pt
\hbox{\:@\char'142}\hskip 0pt minus 100pt}\!
\hbox to 0pt{\hskip 0pt minus 100pt\:e#1\hskip 0pt minus 100pt}\hfill}}}
\ansno 7. If $u$ and $v$ are consecutive floating binary numbers,
$u \oplus v = 2u$ or $2v$. When it is $2v$ we often have $u↑\expcirc2\oplus
v↑\expcirc2<2v↑\expcirc2$. For example,
$u = (.10 \ldotsm 001)↓2$, $v = (.10 \ldotsm 010)↓2$, $u \oplus v = 2v$,
and $u↑\expcirc2+v↑\expcirc2=(.10 \ldotsm 011)↓2$.
\ansno 8. (a)\9 $~$, $\approx$;\xskip (b) $~$, $\approx$;\xskip (c)
$~$, $\approx$;\xskip (d) $~$;\xskip (e) $~$.
\ansno 9. $|u - w| ≤ |u - v| + |v - w| ≤ ε↓1 \min(b↑{e↓u
-q}, b↑{e↓w-q}) + ε↓2 \min(b↑{e↓v-q}, b↑{e↓w
-q}) ≤ ε↓1b↑{e↓u-q}+ε↓2b↑{e↓w-q} ≤ (ε↓1
+ ε↓2)\max(b↑{e↓u-q}, b↑{e↓w-q})$. The result
cannot be strengthened in general, since for example we might
have $e↓u$ very small compared to both $e↓v$ and $e↓w$, and
this means that $u - w$ might be fairly large under the hypotheses.
\ansno 10. We have $(.a↓1 \ldotsm a↓{p-1}a↓p)↓b \otimes
(.9 \ldotsm 99)↓b = (.a↓1 \ldotsm a↓{p-1}(a↓p - 1))↓b$ if $a↓p
≥ 1$; here ``9'' stands for $b - 1$. Furthermore, $(.a↓1 \ldotsm
a↓{p-1}a↓p)↓b \otimes (1.0 \ldotsm 0)↓b = (.a↓1 \ldotsm a↓{p-1}0)↓b$, so
the multiplication is not monotone if $b>2$ and $a↓p≥2$.
But when $b = 2$, this argument
can be extended to show that multiplication {\sl is} monotone; obviously
the ``certain computer'' had $b > 2$.
\ansno 11. Without loss of generality, let $x$ be an integer,
$|x| < b↑p$. If $e ≤ 0$ then $t = 0$. If $0 < e ≤ p$ then $x
- t$ has at most $p + 1$ digits, the least significant being
zero. If $e > p$ then $x - t = 0$.\xskip [The result holds also under
the weaker hypothesis $|t| < 2b↑e$.]
\def\dprime{↑{\prime\prime}}
\ansno 12. Assume that $e↓u = p$, $e↓v ≤ 0$, $u > 0$. Case 1, $u
> b↑{p-1}$. Case (1b), $w = u$, $|v| ≤ {1\over 2}$. Then $u↑\prime
= u$, $v↑\prime = 0$, $u\dprime = u$, $v\dprime = 0$. If $|v| = {1\over
2}$ and more general rounding is permitted we might also have
$u↑\prime = u \pm 1$, $v\dprime = \mp1$. Case (1c), $w = u - 1$, $v ≤ -{1\over 2}$,
$e↓v = 0$. Then $u↑\prime = u$ or $u - 1$, $v↑\prime = -1$, $u\dprime
= u$, $v\dprime = -1$ or 0. Case 2, $u = b↑{p-1}$. Case (2a),
$w = u + 1$, $v ≥ {1\over 2}$, $e↓v = 0$. Like (1a). Case (2b),
$w = u$, $|v| ≤ {1\over 2}$, $u↑\prime ≥ u$. Like (1b). Case (2c),
$w = u$, $|v| ≤ {1\over 2}$, $u↑\prime < u$. Then $u↑\prime = u
- j/b$ where $v = j/b + v↓1$ and $|v↓1| ≤ {1\over 2}b↑{-1}$ for
some positive integer $j ≤ {1\over 2}b$; we have $v↑\prime =
0$, $u\dprime = u$, $v\dprime = j/b$. Case (2d), $w < u$. Then $w
= u - j/b$ where $v = -j/b + v↓1$ and $|v↓1| ≤ {1\over 2}b↑{-1}$
for some positive integer $j ≤ b$; we have $(v↑\prime , u\dprime
) = (-j/b, u)$, and $(u↑\prime , v\dprime ) = (u, -j/b)$ or
$\biglp u - 1/b, (1 - j)/b\bigrp $, the latter case only when $v↓1
= {1\over 2}b↑{-1}$. In all cases $u \ominus u↑\prime = u -
u↑\prime$, $v \ominus v↑\prime = v - v↑\prime$, $u \ominus u↑\prime
= u - u\dprime$, $v \ominus v\dprime = v - v\dprime $, round$(w
- u - v) = w - u - v$.
\ansno 13. Since round$(x) = 0$ iff $x = 0$, we want to find a large set of
integer pairs $(m, n)$ with the property that $m \odiv n$ is
an integer iff $m/n$ is. Assume that $|m|, |n| < b↑p$. If $m/n$
is an integer then $m \odiv n = m/n$ is also. Conversely if
$m/n$ is not an integer, but $m\odiv n$ is, we have $1/|n|
≤ |m\odiv n - m/n| < {1\over 2} |m/n| b↑{1-p}$, hence $|m|
> 2b↑{p-1}$. Our answer is therefore to require $|m| ≤ 2b↑{p-1}$
and $0 < |n| < b↑p$.\xskip (Slightly weaker hypotheses are also possible.)
\ansno 14. $|(u \otimes v) \otimes w - uvw| ≤ |(u \otimes v)
\otimes w - (u \otimes v)w| + |w|\,|u \otimes v - uv| ≤ \delta
↓{(u\otimes v)\otimes w} + b↑{e↓w-q-l↓w}\delta ↓{u\otimes v}
≤ (1 + b) \delta ↓{(u\otimes v)\otimes w}$. Now $|e↓{(u\otimes v)\otimes w}
- e↓{u\otimes(v\otimes w)}| ≤ 2$, so we may take $ε = {1\over 2}(1
+ b)b↑{2-p}$.
\ansno 15. $u ≤ v$ implies that $(u \oplus u) \odiv 2 ≤ (u
\oplus v) \odiv 2 ≤ (v \oplus v) \odiv 2$, so the condition
holds for all $u$ and $v$ iff it holds whenever $u = v$. For
base $b = 2$, the condition is therefore always satisfied (barring
overflow); but for $b > 2$ there are numbers $v ≠ w$ such that
$v \oplus v = w \oplus w$, hence the condition fails.\xskip [On the
other hand, the formula $u \oplus \biglp (v \ominus u) \odiv
2\bigrp$ does give a midpoint in the correct range. {\sl Proof:
}It suffices to show that $u + (v \ominus u) \odiv 2 ≤
v$, i.e., $(v \ominus u) \odiv 2 ≤ v - u$; and it is easy
to verify that round$\biglp{1\over 2}\hbox{round}(x)\bigrp ≤ x$
for all $x ≥ 0$.]
\ansno 16. (a)\9 Exponent changes occur at $\sum ↓{10} = 11.111111$,
$\sum ↓{91} = 101.11111$, $\sum ↓{901} = 1001.1102$, $\sum ↓{9001}
= 10001.020$, $\sum ↓{90009} = 100000.91$, $\sum ↓{900819} = 1000000.0$;
therefore $\sum ↓{1000000} = 1109099.1$.
\vskip 2pt
(b)\9 $\sum ↓{1≤k≤n} 1.2345679
= 1224782.1$, and (14) tries to take the square root of $-.0053187053$.
But (15) and (16) are exact in this case.\xskip $\biglp$If $x↓k = 1 + \lfloor
(k - 1)/2\rfloor 10↑{-7}$, (15) and (16) have errors of order
$n$.$\bigrp$
(c)\9 We need to show that $u \oplus \biglp(v \ominus
u) \odiv k\bigrp$ lies between $u$ and $v$; see exercise↔15.
\mixans 17. {⊗FCMP⊗STJ⊗9F⊗Floating point comparison subroutine:\cr
⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
⊗⊗STA⊗TEMP\cr
⊗⊗LDAN⊗TEMP⊗$v ← -v$.\cr
\noalign{\vskip 3pt
\hbox{\quad (Copy here lines 07-20 of Program 4.2.1A.)}\vskip3pt}
⊗⊗LDX⊗FV(0:0)⊗Set rX to zero with sign of $f↓v$.\cr
\\⊗⊗DEC1⊗5\cr
⊗⊗J1N⊗*+2\cr
⊗⊗ENT1⊗0⊗Replace large difference in exponents\cr
⊗⊗SRAX⊗5,1⊗\qquad by a smaller one.\cr
\\⊗⊗ADD⊗FU⊗$\rA ← \null$difference of operands.\cr
⊗⊗JOV⊗7F⊗Fraction overflow: not $~$.\cr
⊗⊗CMPA⊗EPSILON(1:5)\cr
⊗⊗JG⊗8F⊗Jump if not $~$.\cr
⊗⊗JL⊗6F⊗Jump if $~$.\cr
\\⊗⊗JXZ⊗9F⊗Jump if $~$.\cr
⊗⊗JXP⊗1F⊗If $|\rA| = ε$, check sign of $\rA \times \rX$.\cr
\\⊗⊗JAP⊗9F⊗Jump if $~$. $(\rA ≠ 0)$\cr
⊗⊗JMP⊗8F\cr
\\⊗7H⊗ENTX⊗1\cr
⊗⊗SRC⊗1⊗Make rA nonzero with same sign.\cr
⊗⊗JMP⊗8F\cr
\\⊗1H⊗JAP⊗8F⊗Jump if not $~$. $(\rA ≠ 0)$\cr
\\⊗6H⊗ENTA⊗0\cr
⊗8H⊗CMPA⊗=0=⊗Set comparison indicator.\cr
⊗9H⊗JMP⊗*⊗Exit from subroutine.\quad\blackslug\cr}
%folio 742 galley 4a (C) Addison-Wesley 1978 *
\ansno 19. Let $\gamma ↓k = \delta ↓k = \eta ↓k = \sigma
↓k = 0$ for $k > n$. It suffices to find the coefficient of
$x↓1$, since the coefficient of $x↓k$ will be the same except
with all subscripts increased by $k - 1$. Let $(f↓k, g↓k)$ denote
the coefficient of $x↓1$ in $(s↓k - c↓k, c↓k)$ respectively.
Then $f↓1 = (1 + \eta ↓1)(1 - \gamma ↓1 - \gamma ↓1\delta ↓1
- \gamma ↓1\sigma ↓1 - \delta ↓1\sigma ↓1 - \gamma ↓1 \delta
↓1\sigma ↓1)$, $g↓1 = (1 + \delta ↓1)(1 + \eta ↓1)(\gamma ↓1 +
\sigma ↓1 + \gamma ↓1\sigma ↓1)$, and $f↓k = (1 - \gamma ↓k\sigma
↓k - \delta ↓k\sigma ↓k - \gamma ↓k\delta ↓k\sigma ↓k)f↓{k-1}
+ (\gamma ↓k - \eta ↓k + \gamma ↓k \delta ↓k + \gamma ↓k\eta
↓k + \gamma ↓k \delta ↓k\eta ↓k + \gamma ↓k\eta ↓k\sigma ↓k
+ \delta ↓k\eta ↓k\sigma ↓k + \gamma ↓k \delta ↓k\eta ↓k\sigma
↓k)g↓{k-1}$, $g↓k = \sigma ↓k(1 + \gamma ↓k)(1 + \delta ↓k)f↓{k-1}
- (1 + \delta ↓k)(\gamma ↓k + \gamma ↓k\eta ↓k + \eta ↓k\sigma
↓k + \gamma ↓k\eta ↓k\sigma ↓k)g↓{k-1}$, for $1 < k ≤ n$. Thus
$f↓n = 1 + \eta ↓1 - \gamma ↓1 + (4n$ terms of 2nd order$) +
($higher order terms$) = 1 + \eta ↓1 - \gamma ↓1 + O(nε↑2)$ is
sufficiently small.\xskip [The Kahan summation formula was first published
in {\sl CACM \bf 8} (1965), 40; cf.\ {\sl Proc.\ IFIP Congress}
(1971), {\bf 2}, 1232. For another approach to accurate summation,
see R. J. Hanson, {\sl CACM \bf 18} (1975), 57--58. See also G. Bohlender,
{\sl IEEE Trans.\ \bf C--26} (1977), 621--632, for algorithms that compute
round$(x↓1+\cdots+x↓n)$ and round$(x↓1\ldotsm x↓n)$ {\sl exactly}, given
$\{x↓1,\ldotss,x↓n\}$.]
\ansno 20. By the proof of Theorem↔C\null, (47) fails for $e↓w =
p$ only if $|v| + {1\over 2} ≥ |w - u| ≥ b↑{p-1} +
b↑{-1}$; hence $|f↓u| ≥ |f↓v| ≥ 1 - ({1\over 2}b - 1)b↑{-p}$.
This rather rare case, in which $|f↓w|$ before normalization
takes its maximum value 2, is necessary and sufficient for failure.
\ansno 21. (Solution by G. W. Veltkamp.)\xskip Let $c = 2↑{\lceil p/2\rceil}
+ 1$; we may assume that $p ≥ 2$, so $c$ is representable.
First compute $u↑\prime = u \otimes c$, $u↓1 = (u \ominus u↑\prime
) \oplus u↑\prime$, $u↓2 = u \ominus u↓1$; similarly, $v↑\prime = v \otimes
c$, $v↓1 = (v \ominus v↑\prime ) \oplus v↑\prime$, $v↓2 = v \ominus
v↓1$. Then set $w ← u \otimes v$, $w↑\prime ← \biglp ((u↓1 \otimes
v↓1\ominus w) \oplus (u↓1 \otimes v↓2)) \oplus (u↓2 \otimes v↓1)\bigrp
\oplus (u↓2 \otimes v↓2)$.
It suffices to prove this when $u, v > 0$ and
$e↓u = e↓u = p$, so that $u$ and $v$ are integers $\in [2↑{p-1},
2↑p)$. Then $u = u↓1 + u↓2$ where $2↑{p-1} ≤ u↓1 ≤ 2↑p$, $u↓1
\mod 2↑{\lceil p/2\rceil}= 0$, and $|u↓2| ≤ 2↑{\lceil p/2\rceil-1}$;
similarly $v = v↓1 + v↓2$. The operations during the calculation
of $w↑\prime$ are exact, because $w - u↓1v↓1$ is a multiple
of $2↑{p-1}$ with $|w - u↓1v↓1| ≤ |w - uv| + |u↓2v↓1 + u↓1v↓2
+ u↓2v↓2| ≤ 2↑{p-1} + 2↑{p+\lceil p/2\rceil}+ 2↑{p-1}$;
and $|w - u↓1v↓1 - u↓1v↓2| ≤ |w - uv| + |u↓2v| < 2↑{p-1} +
2↑{\lceil p/2\rceil-1+p}$, where $w - u↓1v↓1 - u↓1v↓2$ is a multiple
of $2↑{\lceil p/2\rceil}$.
\ansno 22. We may assume that $b↑{p-1} ≤ u, v < b↑p$.
If $uv ≤ b↑{2p-1}$ then $x↓1 = uv - r$ where $|r| ≤ {1\over
2}b↑{p-1}$, hence $x↓2 =\hbox{round}(u - r/v) = x↓0$ (since $|r/v|
≤ {1\over 2}b↑{p-1}/b↑{p-1} ≤ {1\over 2}$, and equality implies
$v = b↑{p-1}$ hence $r = 0$). If $uv > b↑{2p-1}$ then $x↓1 =
uv - r$ where $|r| ≤ {1\over 2}b↑p$, hence $x↓1/v ≤ u - r/v
< b↑p + {1\over 2}b$ and $x↓2 ≤ b↑p$. If $x↓2 = b↑p$ then $x↓3
= x↓1$ $\biglp$for otherwise $(b↑p - {1\over 2})v ≤ x↓1 ≤ b↑p(v - 1)\bigrp$.
If $x↓2 < b↑p$ and $x↓1 > b↑{2p-1}$ then let $x↓2 = x↓1/v +
q$ where $|q| ≤ {1\over 2}$; we have $x↓3 =\hbox{round}(x↓1+
qv) = x↓1$. Finally if $x↓2 < b↑p$ and $x↓1 = b↑{2p-1}$ and
$x↓3 < b↑{2p-1}$ then $x↓4 = x↓2$ by the first case above. This
situation arises e.g.\ for $b = 10$, $p = 2$, $u = 19$, $v = 55$, $x↓1
= 1000$, $x↓2 = 18$, $x↓3 = 990$.
\def\omod{\hbox to 25pt{\hfill$\vcenter{\hbox{\:@\char'140}}$\hfill}\hskip-25pt
\raise .3pt\hbox to 25pt{\hfill$\vcenter{\moveleft .2pt\hbox{\:emod}}$\hfill}}
\ansno 23. Let $\lfloor u\rfloor = n$, so that
$u\omod 1 = u \ominus n = u - n + r$ where $|r| ≤ {1\over 2}b↑{-p}$;
we wish to show that round$(n - r) = n$. The result is clear
if $|n| > 1$; and $r = 0$ when $n = 0$ or 1, so the only subtle
case is when $n = -1$, $r = -{1\over 2}b↑{-p}$. The identity fails
iff $b$ is a multiple of 4 and $-b↑{-1} < u < -b↑{-2}$ and $u
\mod 2b↑{-p} = {3\over 2}b↑{-p}$ $\biglp$e.g., $p = 3$, $b = 8$,
$u = -(.0124)↓8\bigrp$.
\def\upper#1{\mathop{\hbox to 8.889pt{\:u\char'156\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\def\lwr#1{\mathop{\hbox to 8.889pt{\:u\char'157\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\ansno 24. Let $u = [u↓{\,l}, u↓r]$, $v = [v↓{\,l}, v↓r]$. Then $u \oplus v =
[u↓{\,l}\lwr+v↓{\,l},u↓r\upper+v↓r]$,
where $x\upper+y=y\upper+x$, $x \upper+
+0 = x$ for all $x$, $x\upper+ - 0 = x$ for all
$x ≠ +0$, $x\upper+ +∞ = +∞$ for all $x≠-∞$, and
$x \upper+ -∞$ needn't be defined; $x \lwr+
y = -\biglp (-x) \upper+ (-y)\bigrp$.
If $x\oplus y$ would overflow in normal floating point arithmetic because
$x+y$ is too large, then $x\upper+y$ is $+∞$ and
$x\lwr+y$ is the largest representable number.
For subtraction, let $u \ominus v = u \oplus (-v)$, where $-v = [-v↓r, -v↓{\,l}]$.
Multiplication is somewhat more complicated. The correct procedure is to
let $$u \otimes v = \hbox{\:a[}\min(u↓{\,l}\lwr\times v↓{\,l},u↓{\,l}\lwr\times v↓r,u↓r
\lwr\times v↓{\,l},u↓r\lwr\times v↓r),\;\max(u↓{\,l}\upper\times v↓{\,l},
u↓{\,l}\upper\times v↓r,u↓r\upper\times v↓{\,l}, u↓r\upper\times v↓r)\hbox{\:a]},$$where
$x\upper\times y=y\upper\times x$, $x\upper\times(-y)=-(x\lwr\times y)=
(-x)\upper\times y$; $x\upper\times +0=(+0$ for $x>0$, $-0$ for $x<0)$;
$x\upper\times -0=-(x\upper\times+0)$; $x\upper\times+∞=(+∞$ for
$x>+0$, $-∞$ for $x<-0)$.\xskip $\biglp$It
is possible to determine the min and max simply
by looking at the signs of $u↓{\,l}$, $u↓r$, $v↓{\,l}$, and $v↓r$, thereby computing
only two of the eight products, except when $u↓{\,l}<0<u↓r$ and $v↓{\,l}<0<v↓r$; in the
latter case we compute four products, and the answer is $[\min(u↓{\,l}\lwr\times v↓r,
u↓r\lwr\times
v↓{\,l}),\max(u↓{\,l}\upper\times v↓{\,l}, u↓r\upper\times v↓r)]$.$\bigrp$
Finally, $u \odiv v$ is undefined if $v↓{\,l} < 0
< v↓r$; otherwise we use the formulas for multiplication with
$v↓{\,l}$ and $v↓r$ replaced respectively by $v↓r↑{-1}$ and $v↓{\,l}↑{-1}$,
where $x \upper\times y↑{-1}=x\upper/y$,
$x\lwr\times y↑{-1}=x\lwr/y$, $(\pm0)↑{-1}=\pm∞$, $(\pm∞)↑{-1}=\pm0$.
$\biglp$Cf.\
E. R. Hansen, {\sl Math.\ Comp.\ \bf22} (1968), 374--384. An alternative
scheme, in which division by 0 gives no error messages and intervals may be
neighborhoods of $∞$, has been proposed by W. M. Kahan. In Kahan's scheme, for
example, the reciprocal of $[-1,+1]$ is $[+1,-1]$, and an attempt to multiply an
interval containing 0 by an interval containing $∞$ yields $[-∞,+∞]$, the
set of all numbers. See {\sl Numerical Analysis}, Univ.\ Michigan Engineering
Summer Conf.\ Notes No.\ 6818 (1968).$\bigrp$
\ansno 25. Cancellation reveals {\sl previous} errors in the
computation of $u$ and $v$. For example, if $ε$ is small, we
often get poor accuracy when computing $f(x + ε) \ominus f(x)$,
because the rounded calculation of $f(x + ε)$ destroys much
of the information about $ε$. It is desirable to rewrite
such formulas as $ε \otimes g(x, ε)$, where $g(x, ε) = \biglp
f(x + ε) - f(x)\bigrp/ε$ is first computed symbolically.
Thus, if $f(x) = x↑2$ then $g(x, ε) = 2x + ε$; if $f(x) =
\sqrt{x}$ then $g(x, ε) = 1/(\sqrt{x
+ ε} + \sqrt{x}\,)$.
\ansno 26. See {\sl Math.\ Comp.\ \bf32} (1978), 227--232.
%folio 745 galley 4b (C) Addison-Wesley 1978 *
\ansbegin{4.2.3}
\ansno 1. First, $(w↓m, w↓{\,l}) = (.573,
.248)$; then $w↓mv↓{\,l}/v↓m = .290$; so the answer is $(.572, .958)$.
This in fact is the correct result to six decimals.
\ansno 2. The answer is not affected, since the normalization
routine truncates to eight places and can never look at this
particular byte position.\xskip (Scaling to the left occurs at most
once during normalization, since the inputs are normalized.)
\ansno 3. Overflow obviously cannot occur at line 09, since
we are adding two-byte quantities, or at line 22, since we are
adding four-byte quantities. In line 30 we are computing the
sum of three four-byte quantities, so this cannot overflow.
Finally, in line 32, overflow is impossible because the product
$f↓uf↓v$ must be less than unity.
\ansno 4. Insert ``\.{JOV OFLO; ENT1 0}'' between lines 03 and 04.
Replace lines 21--22 by ``\.{ADD TEMP(ABS);} \.{JNOV *+2;} \.{INC1 1}'',
and change lines 28--31 to ``\.{SLAX 5;} \.{ADD TEMP;} \.{JNOV *+2;} \.{INC1
1;} \.{ENTX 0,1;} \.{SRC 5}''. This adds five lines of code and only
1, 2, or 3 units of execution time.
\ansno 5. Insert ``\.{JOV OFLO}'' after line 06. Change lines 22,
31, 39 respectively to ``\.{SRAX 0,1}'', ``\.{SLAX 5}'', ``\.{ADD ACC}''.
Between lines
40 and 41, insert ``\.{DEC2 1;} \.{JNOV DNORM;} \.{INC2 1;} \.{INCX 1;} \.{SRC
1}''.\xskip(It's tempting to remove the ``\.{DEC 1}'' in favor of ``\.{STZ EXPO}'',
but then ``\.{INC2 1}'' might overflow rI2!)\xskip This adds six lines of
code; the running time {\sl decreases} by $3u$, unless there is fraction
overflow, when it increases by $7u$.
\mixans 6. {⊗DOUBLE⊗STJ⊗EXITDF⊗Convert to double precision:\cr
⊗⊗ENTX⊗0⊗Clear rX.\cr
⊗⊗STA⊗TEMP\cr
\\⊗⊗LD2⊗TEMP(EXP)⊗$\rI2 ← e$.\cr
⊗⊗INC2⊗QQ-Q⊗Correct for difference in excess.\cr
\\⊗⊗STZ⊗EXPO⊗$\.{EXPO} ← 0$.\cr
⊗⊗SLAX⊗1⊗Remove exponent.\cr
⊗⊗JMP⊗DNORM⊗Normalize and exit.\cr
\noalign{\yskip}
⊗SINGLE⊗STJ⊗EXITF⊗Convert to single precision:\cr
⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
\\⊗⊗STA⊗TEMP\cr
⊗⊗LD2⊗TEMP(EXPD)⊗$\rI2 ← e$.\cr
⊗⊗DEC2⊗QQ-Q⊗Correct for difference in excess.\cr
\\⊗⊗SLAX⊗2⊗Remove exponent.\cr
⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\quad\blackslug\cr}
\ansno 7. All three routines give zero as the answer if
and only if the exact result would be zero, so we need not worry
about zero denominators in the expressions for relative error.
The worst case of the addition routine is pretty bad: Visualized
in decimal notation, if the inputs are 1.0000000 and .99999999,
the answer is $b↑{-7}$ instead of $b↑{-8}$; thus the maximum
relative error $\delta ↓1$ is $b - 1$, where $b$ is the byte
size.
For multiplication and division, we may assume that the
exponents of both operands are \.{QQ}, and that both operands are
positive. The maximum error in multiplication is readily bounded
by considering Fig.↔4: When $uv ≥ 1/b$, we have $0 ≤ uv - u \otimes
v < 3b↑{-9} + (b - 1)b↑{-9}$, so the relative error is bounded
by $(b + 2)b↑{-8}$. When $1/b↑2 ≤ uv < 1/b$, we have $0 ≤ uv
- u \otimes v < 3b↑{-9}$, so the relative error in this case
is bounded by $3b↑{-9}/uv ≤ 3b↑{-7}$. We take $\delta ↓2$ to
be the larger of the two estimates, namely $3b↑{-7}$.
Division requires a more careful analysis of Program↔D\null.
The quantity actually computed by the subroutine is $α -
\delta - bε\biglp (α - \delta↑{\prime\prime} )(β - \delta ↑\prime )
- \delta↑{\prime\prime\prime} \bigrp - \delta ↓n$ where $α = (u↓m + εu↓{\,l})/bv↓m$,
$β = v↓{\,l}/bv↓m$, and the nonnegative truncation errors
$(\delta , \delta ↑\prime , \delta↑{\prime\prime},
\delta↑{\prime\prime\prime})$ are respectively less than
$(b↑{-10}, b↑{-5}, b↑{-5}, b↑{-6})$; finally
$\delta ↓n$ (the truncation during
normalization) is nonnegative and less than either
$b↑{-9}$ or $b↑{-8}$, depending
on whether scaling occurs or not. The actual value of the quotient
is $α/(1 + bεβ) = α - bεαβ + b↑2αβ↑2\delta↑{\prime\prime\prime\prime}
$, where $\delta↑{\prime\prime\prime\prime}$ is the nonnegative error due
to truncation of the infinite series (2); here $\delta↑{\prime\prime\prime\prime}
< ε↑2=b↑{-10}$, since it is an alternating series. The relative error
is therefore the absolute value of $(bε\delta ↑\prime + bε\delta↑{\prime\prime}
β/α + bε\delta↑{\prime\prime\prime}/α) - (\delta /α + bε\delta ↑\prime
\delta↑{\prime\prime}/α + b↑2β↑2\delta↑{\prime\prime\prime\prime}+ \delta ↓n/α)$,
times $(1 + bεβ)$. The positive terms in this expression are
bounded by $b↑{-9} + b↑{-8} + b↑{-8}$, and the negative terms
are bounded by $b↑{-8} + b↑{-12} + b↑{-8}$ plus the contribution
by the normalizing phase, which can be about $b↑{-7}$ in magnitude.
It is therefore clear that the potentially greatest part of
the relative error comes during the normalization phase, and
that $\delta ↓3 = (b + 2)b↑{-8}$ is a safe upper bound for
the relative error.
\ansno 8. Addition: If $e↓u ≤ e↓v + 1$, the entire relative
error occurs during the normalization phase, so it is bounded
above by $b↑{-7}$. If $e↓u ≥ e↓v + 2$, and if the signs are
the same, again the entire error may be ascribed to normalization;
if the signs are opposite, the error due to shifting digits
out of the register is in the opposite direction from the subsequent
error introduced during normalization. Both of these errors
are bounded by $b↑{-7}$, hence $\delta ↓1 = b↑{-7}$.\xskip (This is
substantially better then the result in exercise↔7.)
Multiplication: An analysis as in exercise↔7 gives $\delta↓2=(b+2)b↑{-8}$.
%folio 748 galley 5 (C) Addison-Wesley 1978 *
\ansbegin{4.2.4}
\ansno 1. Since fraction overflow can
occur only when the operands have the same sign, this is the
probability that fraction overflow occurs divided by the probability
that the operands have the same sign, namely, $7\%/\biglp{1\over 2}(91\%)\bigrp
\approx 15\%$.
\ansno 3. $\log↓{10} 2.4 - \log↓{10} 2.3 \approx 1.84834\%$.
\ansno 4. The pages would be uniformly gray (same as ``random
point on a slide rule'').
\ansno 5. The probability that $10f↓U ≤ r$ is $(r - 1)/10 +
(r - 1)/100 + \cdots = (r - 1)/9$. So in this case the leading
digits are {\sl uniformly} distributed; e.g., leading digit
1 occurs with probability ${1\over 9}$.
\ansno 6. The probability that there are three leading zero
bits is $\log↓{16} 2 = {1\over 4}$; the probability that there
are two leading zero bits is $\log↓{16} 4 - \log↓{16} 2 = {1\over
4}$; and similarly for the other two cases. The ``average'' number
of leading zero bits is 1$1\over2$, so the ``average'' number
of ``significant bits'' is $p + {1\over 2}$. The worst case,
$p - 1$ bits, occurs however with rather high probability. In
practice, it is usually necessary to base error estimates on
the worst case, since a chain of calculations is only as strong as its weakest link.
In the error analysis of Section 4.2.2, the
upper bound on relative rounding error for floating hex is $2↑{1-p}$.
In the binary case we can have $p + 1$ significant bits at all
times (cf.\ exercise 4.2.1--3), with relative rounding errors
bounded by $2↑{-1-p}$. Extensive computational experience confirms
that floating binary (even with $p$-bit precision instead of
$p + 1$) produces significantly more accurate results than $(p
+ 2)$-bit floating hex.
Tables 1 and 2 show that hexadecimal arithmetic
can be done a little faster, since fewer cycles are needed when
scaling to the right or normalizing to the left. But this fact
is insignificant compared to the substantial advantages of $b
= 2$ over other radices (cf.\ also Theorem 4.2.2C and exercises
4.2.2--15,↔21), especially since floating binary can be made
as fast as floating hex with only a tiny increase in total processor
cost.
\ansno 7. For example, suppose that $\sum ↓m \biglp F(10↑{km}
\cdot 5↑k) - F(10↑{km})\bigrp = \log 5↑k/\log 10↑k$ and also that $\sum
↓m \biglp (F(10↑{km} \cdot 4↑k) - F(10↑{km})\bigrp = \log 4↑k/\log
10↑k$; then
$$\chop to 9pt{\sum ↓{m}\,\biglp F(10↑{km} \cdot 5↑k) - F(10↑{km} \cdot 4↑k)\bigrp
= \log↓{10} {5\over 4}}$$
for all $k$. But now let $ε$ be a small positive
number, and choose $\delta > 0$ so that $F(x) < ε$ for $0 < x
< \delta $, and choose $M > 0$ so that $F(x) > 1-ε$ for $x
> M$. We can take $k$ so large that $10↑{-k} \cdot 5↑k < \delta$
and $4↑k > M$; hence by the monotonicity of $F$,
$$\eqalign{\sum ↓{m}\, \biglp F(10↑{km} \cdot 5↑k) - F(10↑{km} \cdot 4↑k)\bigrp
⊗ ≤ \sum ↓{m≤0}\, \biglp F(10↑{km} \cdot 5↑k) - F(10↑{k(m-1)}
\cdot 5↑k)\bigrp \cr
⊗\qquad\null + \sum ↓{m≥0}\, \biglp F(10↑{k(m+1)}
\cdot 4↑k) - F(10↑{km} \cdot 4↑k)\bigrp \cr
⊗= F(10↑{-k}5↑k) + 1 - F(10↑k4↑k) < 2ε.\cr}$$
\ansno 8. When $s > r$, $P↓0(10↑ns)$ is 1 for small $n$,
and 0 when $\lfloor 10↑ns\rfloor > \lfloor 10↑nr\rfloor $. The
least $n$ for which this happens may be arbitrarily large, so
no uniform bound can be given for $N↓0(ε)$ independent of $s$.\xskip
(In general, calculus textbooks prove that such a uniform bound
would imply that the limit function $S↓0(s)$ would be continuous,
and it isn't.)
\ansno 9. Let $q↓1$, $q↓2$, $\ldots$ be such that $P↓0(n) = q↓1{n-1\choose0}
+ q↓2{n-1\choose1} + \cdots$ for all $n$. It follows that
$P↓m(n) = 1↑{-m}q↓1{n-1\choose0} + 2↑{-m}q↓2{n-1\choose1}
+\cdots$ for all $m$ and $n$.
\ansno 10. When $1 < r < 10$ the generating function $C(z)$
has simple poles at the points $1 + w↓n$, where $w↓n = 2πni/\ln 10$,
hence
$$\chop to 9pt{C(z) = {\log↓{10} r - 1\over 1 - z} + \sum ↓{n≠0}
{1 + w↓n\over w↓n} {e↑{-w↓n\ln r} - 1\over (\ln 10)(z
- 1 - w↓n)} + E(z)}$$
where $E(z)$ is analytic in the entire plane. Thus
if $\theta = \hbox{arctan}(2π/\ln 10)$,
$$\eqalign{c↓m ⊗ =\log↓{10} r - 1 - {2\over \ln 10} \sum ↓{n>0}
\real\bigglp{e↑{-w↓n\ln r} - 1\over w↓n(1 + w↓n)↑m}\biggrp+ e↓m\cr
⊗= \log↓{10} r - 1 + {\sin(m\theta + 2π\log↓{10}
r) -\sin(m\theta )\over π\biglp1 + 4π↑2/(\ln 10)↑2\bigrp ↑{m/2}}
+ O\bigglp{1\over \biglp 1 + 16π↑2/(\ln 10)↑2\bigrp ↑{m/2}}\biggrp.\cr}$$
\ansno 11. When $(\log↓b U) \mod 1$ is uniformly distributed
in $[\,0, 1)$, so is $(\log↓b 1/U)\mod 1 = (1 - \log↓b U)\mod 1$.
\ansno 12. We have $h(z) = \int ↑{z}↓{1/b} f(x)\,dx\,g(z/bx)/bx + \int
↑{1}↓{z} f(x)\,dx\,g(z/x)/x$, and the same holds for $l(z) = \int ↑{z}↓{1/b} f(x)
\,dx\,l(z/bx)/bx + \int ↑{1}↓{z} f(x)\,dx\,l(z/x)/x$, hence $$
{h(z) - l(z)\over l(z)} = \int ↑{z}↓{1/b} f(x)\,dx\,{g(z/bx)
- l(z/bx)\over l(z/bx)} + \int ↑{1}↓{z} f(x)\,dx\,{g(z/x)
- l(z/x)\over l(z/x)}.$$ Since $f(x) ≥ 0$, $\hbox{\:u\char'152}
\biglp h(z) - l(z)\bigrp)/l(z)\hbox{\:u\char'152}
≤ \int ↑{z}↓{1/b} f(x)\,dx\,A(g) + \int ↑{1}↓{z} f(x)\,dx\,A(g)$
for all $z$, hence $A(h) ≤ A(g)$. By symmetry, $A(h) ≤ A(f)$.\xskip
[{\sl Bell System Tech.\ J.\ \bf 49} (1970), 1609--1625.]
\ansno 13. Let $X = (\log↓b U)\mod 1$ and $Y = (\log↓b V)\mod
1$, so that $X$ and $Y$ are independently and uniformly distributed
in $[\,0, 1)$. No left shift is needed if and only if $X + Y ≥ 1$,
and this occurs with probability ${1\over 2}$.
(Similarly, the probability is $1\over2$ that floating point division by
Algorithm 4.2.1M needs no normalization shifts;
this needs only the weaker assumption
that both operands independently have the {\sl same} distribution.)
\ansno 14. For convenience, the calculations are shown here
for $b = 10$. If $k = 0$, the probability of a carry is
$$\hbox to size{$\vcenter{
\hbox to 250pt{$\dispstyle\hfill\left(
1\over\ln 10\right)↑2\int ↓{\scriptstyle1≤x,y≤10\atop\scriptstyle x+y≥10} {dx\over
x}\,{dy\over y}.\hfill$}
\vskip 8pt
\hbox{(See Fig.\ A--7.) The value of the integral is}
\vskip 11pt
\hbox to 250pt{$\dispstyle\hfill\int↓0↑{10}{dy\over y}\int↓{10-y}↑{10}{dx\over x}
-2\int↓0↑1{dy\over y}\int↓{10-y}↑{10}{dx\over x},\hfill$}}\hfill
\vcenter{\vskip33mm\hbox to 32mm{\hfill\caption Fig.\ A--7.\hfill\hskip-10pt}}$}$$
and
$$\int ↑{t}↓{0}{dy\over y}\,\ln\left(1\over 1 - y/10\right)=
\int ↑{t}↓{0}\left({1\over 10} + {y\over 200}+ {y↑2\over
3000} + \cdotss\right)\,dy = {t\over 10} + {t↑2\over 400} +
{t↑3\over 9000} + \cdotss.$$
(The latter integral is essentially a ``dilogarithm.'')
Hence the probability of a carry when $k = 0$ is
$(1/\ln 10)↑2(π↑2/6 - 2 \sum ↓{n≥1} 1/n↑210↑n) = .27154$.\xskip [{\sl Note:}
When $b= 2$ and $k = 0$, fraction overflow {\sl
always} occurs, so this derivation proves that $\sum ↓{n≥1}
1/n↑22↑n = π↑2/12 - {1\over 2}(\ln 2)↑2$.]
When $k > 0$, the probability is
$$\left(1\over\ln 10\right)↑2\int ↑{10↑{1-k}}↓{10↑{-k}}\≤\≤{dy\over
y}\int ↑{10}↓{10-y}{dx\over x} = \left(1\over\ln 10\right)↑2\bigglp\sum
↓{n≥1} {1\over n↑210↑{nk}} - \sum ↓{n≥1} {1\over n↑210↑{n(k+1)}}\biggrp.$$
Thus when $b = 10$, fraction overflow should occur
with probability $.272p↓0 + .017p↓1 + .002p↓2 + \cdotss$. When
$b = 2$ the corresponding figures are $p↓0 + .655p↓1 + .288p↓2
+ .137p↓3 + .067p↓4 + \cdotss$.
Now if we use the probabilities from Table↔1,
dividing by .91 to eliminate zero operands and
assuming that the probabilities are independent of the operand
signs, we predict a probability of about 14 percent when $b
= 10$, instead of the 15 percent in exercise↔1. For $b = 2$,
we predict about 48 percent, while the table yields 44 percent.
These results are certainly in agreement within the limits of
experimental error.
\ansno 15. When $k = 0$, the leading digit is 1 if and only
if there is a carry.\xskip (It is possible for fraction overflow and
subsequent rounding to yield a leading digit of 2, when $b ≥
4$, but we are ignoring rounding in this exercise.)\xskip The probability
of fraction overflow is $.272 < \log↓{10} 2$, as shown in the
previous exercise.
When $k > 0$, the leading digit is 1 with probability
$$\left(1\over\ln 10\right)↑2\bigglp\int ↑{10↑{1-k}}↓{10↑{-k}}\≤\≤\≤{dy\over
y}\int ↓{\scriptstyle1≤x<2-y\atop\scriptstyle\hbox{\:eor }10-y≤x<10}
\≤\≤{dx\over x}\biggrp < \left(1\over\ln 10\right)↑2
\bigglp\int ↑{10↑{1-k}}↓{10↑{-k}}\≤\≤\≤{dy\over y}\int ↓{1≤x≤2}\≤{dx\over
x}\biggrp = \log↓{10} 2.$$
\ansno 16. To prove the hint [which is due to Landau,
{\sl Prace matematyczno-fizyczne \bf 21} (1910), 103--113],
assume first that $\limsup a↓n = λ > 0$. Let $ε = λ/(λ + 4M)$
and choose $N$ so that $|a↓1 + \cdots + a↓n| < {1\over 10}ελn$
for all $n > N$. Let $n > N/(1 - ε)$, $n > 5/ε$ be such that $a↓n
> {1\over 2}λ$. Then, by induction, $a↓{n-k} ≥ a↓n - kM/(n - εn)
> {1\over 4}λ$ for $0 ≤ k < εn$, and $\sum ↓{n-εn<k≤n} a↓k ≥
{1\over 4}λ(εn - 1) > {1\over 5}λεn$. But $$\textstyle\left|\sum ↓{n-εn<k≤n}
a↓k\right| = \left|\sum ↓{1≤k≤n} a↓k - \sum ↓{1≤k≤n-εn} a↓k\right| ≤ {1\over
5}ελn$$ since $n - εn > N$. A similar contradiction applies if
$\liminf a↓n < 0$.
Assuming that $P↓{m+1}(n) → λ$ as $n → ∞$, let
$a↓k = P↓m(k) - λ$. If $m > 0$, the $a↓k$ satisfy the hypotheses
of the hint (cf.\ Eq.\ 4.2.2--15),
since $0 ≤ P↓m(k)≤ 1$; hence $P↓m(n) → λ$.
\ansno 17. See {\sl Fibonacci Quarterly \bf 7} (1969), 474--475.\xskip
$\biglp$Persi Diaconis [Ph.D. thesis, Harvard University, 1974] has
shown among other things that the definition of probability
by repeated averaging is weaker than harmonic probability, in
the following precise sense: If $\lim↓{m→∞}\liminf↓{n→∞}
P↓m(n) = \lim↓{m→∞}\limsup↓{n→∞} P↓m(n) = λ$ then
the harmonic probability is $λ$. On the other hand the statement
``$10↑{k↑2}≤ n < 10↑{k↑2+k}$ for some integer
$k > 0$'' has logarithmic probability ${1\over 2}$, while repeated
averaging never settles down to give it any particular probability.$\bigrp$
\ansno 18. Let $p(a)=P(L↓a)$ and $p(a,b)=\sum↓{a≤k<b}p(k)$ for $1≤a<b$.
Since $L↓a=L↓{10a}∪L↓{10a+1}∪\cdots∪L↓{10a+9}$ for all $a$, we have
$p(a)=p\biglp10a,10(a+1)\bigrp$ by (i). Furthermore since $P(S)=P(2S)+P(2S+1)$
by (i), (ii), (iii), we have $p(a)=p\biglp2a,2(a+1)\bigrp$. It follows that
$p(a,b)=p(2↑m10↑na,2↑m10↑nb)$ for all $m,n≥0$.
If $1<b/a<b↑\prime/a↑\prime$ then $p(a,b)≤p(a↑\prime,b↑\prime)$. The reason is
that there exist integers $m$, $n$, $m↑\prime$, $n↑\prime$ such that
$2↑{m↑\prime}10↑{n↑\prime}a↑\prime≤2↑m10↑na<2↑m10↑nb≤2↑{m↑\prime}10↑{n↑\prime}
b↑\prime$ as a consequence of the fact that $\log2/\log10$ is irrational,
hence we can apply (v).\xskip
(Cf.\ exercise 3.5--22 with $k=1$ and $U↓n=n\log2/\log10$.)\xskip
In particular, $p(a)≥p(a+1)$, and it follows that $p(a,b)/p(a,b+1)≥(b-a)/(b+1-a)
$.\xskip (Cf.\ Eq.\ 4.2.2--15.)
Now we can prove that $p(a,b)=p(a↑\prime,b↑\prime)$ whenever $b/a=b↑\prime/a↑\prime
$; for $p(a,b)=p(10↑na,10↑nb)≤c↓np(10↑na,10↑nb-1)≤c↓np(a↑\prime,b↑\prime)$, for
arbitrarily large values of $n$, where
$c↓n=10↑n(b-a)\hbox{\:a/}\biglp10↑n(b-a)-1\bigrp=1+O(10↑{-n})$.
For any positive integer $n$ we have $p(a↑n,b↑n)=p(a↑n,ba↑{n-1})+p(ba↑{n-1},
b↑2a↑{n-2})+\cdots+p(b↑{n-1}a,b↑n)=np(a,b)$. If $10↑m≤a↑n≤10↑{m+1}$ and
$10↑{m↑\prime}≤b↑n≤10↑{m↑\prime+1}$ we have $p(10↑{m+1},10↑{m↑\prime})≤
p(a↑n,b↑n)≤p(10↑m,10↑{m↑\prime+1})$ by (v). But $p(1,10)=1$ by (iv), hence
$p(10↑m,10↑{m↑\prime})=m↑\prime-m$ for all $m↑\prime≥m$. We conclude that
$\lfloor\log↓{10}b↑n\rfloor-\lfloor\log↓{10}a↑n\rfloor-1≤np(a,b)≤\lfloor\log↓{10}
b↑n\rfloor+\lfloor\log↓{10}a↑n\rfloor+1$ for all $n$, and $p(a,b)=\log↓{10}(b/a)$.
[This exercise was inspired by D. I. A. Cohen, who proved a slightly weaker result
in {\sl J. Combinatorial Theory} (A) {\bf20} (1976), 367--370.]
%folio 754 galley 6 (C) Addison-Wesley 1978 *
\ansbegin{4.3.1}
\def\ansalgstep #1 #2. {\anskip\noindent\hbox to 40pt{\!
\hbox to 19pt{\hskip 0pt plus 1000pt minus 1000pt\bf#1 }\hskip 0pt plus 1000pt
\bf#2. }\hangindent 40pt}
\ansno 2. If the $i$th number to be
added is $u↓i = (u↓{i1}u↓{i2} \ldotsm u↓{in})↓b$, use Algorithm↔A
with step A2 changed to the following:
\ansalgstep{} A2\char'43. [Add digits.]
Set$$w↓j ← (u↓{1j} + \cdots + u↓{mj} + k)\mod b,\quad\hbox{and}\quad k ←
\lfloor (u↓{1j} + \cdots + u↓{mj} + k)/b\rfloor.$$
\noindent (The maximum
value of $k$ is $m - 1$, so step A3 would have to be altered
if $m > b$.)
\mixansfour 3. {⊗⊗ENT1⊗N⊗ 1\cr
⊗⊗JOV⊗OFLO⊗ 1⊗Ensure overflow is off.\cr
⊗⊗ENTA⊗0⊗ 1⊗$k ← 0$.\cr
\\⊗2H⊗ENT2⊗0⊗ N⊗$(\rI2 ≡ \hbox{next value of }k)$\cr
⊗⊗ENT3⊗M*N-N,1⊗N⊗$\biglp\.{LOC}(u↓{ij}) ≡ \.U + n(i - 1) + j\bigrp$ \cr
\\⊗3H⊗ADD⊗U,3⊗MN⊗$\rA ← \rA + u↓{ij}$.\cr
⊗⊗JNOV⊗*+2⊗MN\cr
⊗⊗INC2⊗1⊗K⊗Carry one.\cr
\\⊗⊗DEC3⊗N⊗MN⊗Repeat for $m ≥ i ≥ 0$.\cr
⊗⊗J3P⊗3B⊗MN⊗$\biglp\rI3 ≡ n(i - 1) + j\bigrp$ \cr
\\⊗⊗STA⊗W,1⊗N⊗$w↓j ←\rA$.\cr
⊗⊗ENTA⊗0,2⊗N⊗$k ← \rI2$.\cr
\\⊗⊗DEC1⊗1⊗ N\cr
⊗⊗J1P⊗2B⊗N⊗Repeat for $n ≥ j ≥ 0$.\cr
⊗⊗STA⊗W⊗ 1⊗Store final carry in $w↓0$.\quad\blackslug\cr}
\yyskip\noindent Running time, assuming that $K = {1\over 2}MN$, is $5.5MN
+ 7N + 4$ cycles.
\ansno 4. We may make the following assertion before
A1: ``$n ≥ 1$; and $0 ≤ u↓i, v↓i < b$ for $1 ≤ i ≤ n$.'' Before
A2, we assert: ``$1 ≤ j ≤ n$; $0 ≤ u↓i, v↓i < b$ for $1 ≤ i ≤
n$; $0 ≤ w↓i < b$ for $j < i ≤ n$; $0 ≤ k ≤ 1$; and
$$(u↓{j+1} \ldotsm u↓n)↓b + (v↓{j+1} \ldotsm v↓n)↓b = (kw↓{j+1}
\ldotsm w↓n)↓b.''$$
The latter statement means more precisely that
$$\chop to 12pt{\sum ↓{j<t≤n} u↓tb↑{n-t} + \sum ↓{j<t≤n} v↓tb↑{n-t} = kb↑{n-j}
+ \sum ↓{j<t≤n} w↓tb↑{n-t}.}$$
Before A3, we assert: ``$1 ≤ j ≤ n$;
$0 ≤ u↓i, v↓i < b$ for $1 ≤ i ≤ n$; $0 ≤ w↓i < b$ for $j ≤ i ≤
n$; $0 ≤ k ≤ 1$; and $(u↓j \ldotsm u↓n)↓b + (v↓j \ldotsm v↓n)↓b
= (kw↓j \ldotsm w↓n)↓b$.'' After A3, we assert that $0 ≤ w↓i
< b$ for $1 ≤ i ≤ n$; $0 ≤ w↓0 ≤ 1$; and $(u↓1 \ldotsm u↓n)↓b +
(v↓1 \ldotsm v↓n)↓b = (w↓0 \ldotsm w↓n)↓b$.
It is a simple matter to complete the proof by
verifying the necessary implications between the assertions
and by showing that the algorithm always terminates.
\ansalgstep 5. B1. Set $j ← 1$, $w↓0 ← 0$.
\ansalgstep{} B2. Set $t ← u↓j + v↓j$, $w↓j ← t \mod b$,
$i ← j$.
\ansalgstep{} B3. If $t ≥ b$, set $i ← i - 1$, $t ← w↓i
+ 1$, $w↓i ← t \mod b$, and repeat this step until $t < b$.
\ansalgstep{} B4. Increase $j$ by one, and if $j ≤ n$ go
back to B2.\quad\blackslug\lower 5pt\null
\ansalgstep 6. C1. Set $j ← 1$, $i ← 0$, $r ← 0$.
\ansalgstep{} C2. Set $t ← u↓j + v↓j$. If $t ≥ b$, set
$w↓i ← r + 1$, $w↓k ← 0$ for $i < k < j$; then set $i ← j$ and $r
← t \mod b$. Otherwise if $t < b - 1$, set $w↓i ← r$, $w↓k ← b
- 1$ for $i < k < j$; then set $i ← j$ and $r ← t$.
\ansalgstep{} C3. Increase $j$ by one. If $j ≤ n$, go
back to C2; otherwise set $w↓i ← r$, and $w↓k ← b - 1$ for $i
< k ≤ n$.\quad\blackslug\lower 5pt\null
\ansno 7. When $j = 3$, for example,
we have $k = 0$ with probability $(b + 1)/2b$; $k = 1$ with probability
$\biglp (b - 1)/2b\bigrp (1 - 1/b)$, namely the probability
that a carry occurs and that the preceding digit wasn't $b - 1$;
$k = 2$ with probability $\biglp (b - 1)/2b\bigrp (1/b)(1 - 1/b)$;
and $k = 3$ with probability $\biglp (b - 1)/2b\bigrp (1/b)(1/b)(1)$.
For fixed $k$ we may add the probabilities as $j$ varies from
1 to $n$; this gives the mean number of times the carry propagates
back $k$ places,
$$m↓k = {b - 1\over 2b↑k}\left( (n + 1 - k)\left(1
- {1\over b}\right) + {1\over b}\right).$$
As a check, we find that the average number of
carries is
$$m↓1 + 2m↓2 + \cdots + nm↓n = {1\over 2} \biggglp
n - {1\over b - 1}\bigglp 1 - \left(1\over b\right)↑n\,\biggrp\bigggrp,$$
in agreement with (6).
\def\mixanseight #1. #2{\annskip\vbox{
\halign{\hbox to 19pt{##}⊗\lft{\tt##}\quad⊗\lft{\tt##}\quad⊗\lft{\tt##}\quad
⊗$\ctr{##}$\hskip40pt⊗\lft{\tt##}\quad⊗\lft{\tt##}\quad⊗\lft{\tt##}\quad⊗$\ctr{##}
\quad$⊗##\hfill\cr
{\hfill\bf #1. }#2}}}
\mixanseight 8. {⊗⊗ENN1⊗N⊗1⊗2H⊗LDA⊗W+N+1,2⊗K\cr
⊗⊗JOV⊗OFLO⊗1⊗⊗INCA⊗1⊗K\cr
⊗⊗STZ⊗W⊗1⊗⊗STA⊗W+N+1,2⊗K\cr
⊗1H⊗LDA⊗U+N+1,1⊗N⊗⊗DEC2⊗1⊗K\cr
⊗⊗ADD⊗V+N+1,1⊗N⊗⊗JOV⊗2B⊗K\cr
⊗⊗STA⊗W+N+1,1⊗N⊗3H⊗INC2⊗1⊗N\cr
⊗⊗JNOV⊗3F⊗N⊗⊗J2N⊗1B⊗N\cr
⊗⊗ENT2⊗-1,1⊗L⊗⊗⊗⊗⊗\blackslug\cr}
\yyskip\noindent The running time depends on $L$, the number of positions
in which $u↓j + v↓j ≥ b$; and on $K$, the total number of carries.
It is not difficult to see that $K$ is the same quantity that appears in
Program↔A\null. The analysis
in the text shows that $L$ has the average value $N\biglp (b
- 1)/2b\bigrp $, and $K$ has the average value ${1\over 2}(N
- b↑{-1} - b↑{-2} - \cdots - b↑{-n})$. So if we ignore terms
of order $1/b$, the running time is $9N + L + 7K + 3 \approx
13N + 3$ cycles.
{\sl Note:} Since a carry occurs almost
half of the time, it would be more efficient to delay storing
the result by one step. This leads to a somewhat longer program
whose running time is approximately
$12N + 5$ cycles, based on the somewhat more detailed information
calculated in exercise↔7.
\ansno 9. Replace ``$b$'' by ``$b↓{n-j}$'' everywhere in step
A2.
\ansno 10. If lines 06 and 07 were interchanged, we would almost
always have overflow, but register↔A might have a negative value
at line 08, so this would not work. If the instructions on lines
05 and 06 were interchanged, the sequence of overflows occurring
in the program would be slightly different in some cases, but
the program would still be right.
\ansno 11. (a)\9 Set $j
← 1$;\xskip (b) if $u↓j < v↓j$, terminate [$u < v$]; if $u↓j = v↓j$
and $j = n$, terminate [$u = v$]; if $u↓j = v↓j$ and $j < n$,
set $j ← j + 1$ and repeat (b); if $u↓j > v↓j$, terminate
[$u > v$]. This algorithm tends to be quite fast, since there
is usually low probability that $j$ will have to get very high
before we encounter a case with $u↓j ≠ v↓j$.
\ansno 12. Use Algorithm↔S with $u↓j = 0$ and $v↓j = w↓j$. Another
``borrow'' will occur at the end of the algorithm; this
time it should be ignored.
\mixanseight 13. {⊗⊗ENTX⊗N⊗1⊗⊗ADD⊗CARRY⊗N\cr
⊗⊗JOV⊗OFLO⊗1⊗⊗JNOV⊗*+2⊗N\cr
⊗⊗ENTX⊗0⊗1⊗⊗INCX⊗1⊗K\cr
⊗2H⊗STX⊗CARRY⊗N⊗⊗STA⊗W,1⊗N\cr
⊗⊗LDA⊗U,1⊗N⊗⊗DEC1⊗1⊗N\cr
⊗⊗MUL⊗V⊗N⊗⊗J1P⊗2B⊗N\cr
⊗⊗SLC⊗5⊗N⊗⊗STX⊗W⊗1⊗\blackslug\cr}
\yyskip\noindent The running time is $23N + K + 5$ cycles, and $K$ is
roughly ${1\over 2}N$.
\ansno 14. The key inductive assertion is the one that should
be valid at the beginning of step M4; all others are readily
filled in from this one, which is as follows: ``$1 ≤ i ≤ n$;
$1 ≤ j ≤ m$; $0 ≤ u↓r < b$ for $1 ≤ r ≤ n$; $0 ≤ v↓r < b$ for $1
≤ r ≤ m$; $0 ≤ w↓r < b$ for $j < r ≤ m + n$; $0 ≤ k < b$; and
$$(w↓{j+1} \ldotsm w↓{m+n})↓b + kb↑{m+n-i-j} = u \times (v↓{j+1}
\ldotsm v↓m)↓b + (u↓{i+1} \ldotsm u↓n)↓b \times v↓jb↑{m-j}.''$$
(For the precise meaning of this notation, see the answer to exercise↔4.)
\ansno 15. The error is nonnegative and less than $(n - 2)b↑{-n-1}$.
$\biglp$Similarly, if we ignore the products with $i + j > n
+ 3$, the error is bounded by $(n - 3)b↑{-n-2}$, etc.; but,
in some cases, we must compute all of the products if we want
to get the true rounded result.$\bigrp$
\ansalgstep 16. S1. Set $r ← 0$, $j ← 1$.
\ansalgstep{} S2. Set $w↓j ← \lfloor (rb + u↓j)/v\rfloor$,
$r ← (rb + u↓j)\mod v$.
\ansalgstep{} S3. Increase $j$ by 1, and return to S2
if $j ≤ n$.\quad\blackslug\lower 5pt\null
\ansno 17. $u/v > u↓0b↑n/(v↓1 + 1)b↑{n-1} = b\biglp 1 - 1/(v↓1
+ 1)\bigrp > b\biglp 1 - 1/(b/2)\bigrp = b - 2$.
\ansno 18. $(u↓0b + u↓1)/(v↓1 + 1) ≤ u/(v↓1 + 1)b↑{n-1} < u/v$.
\ansno 19. $u - \A qv ≤ u - \A qv↓1b↑{n-1} - \A qv↓2b↑{n-2}
=u↓2b↑{n-2}+\cdots+u↓n+\A rb↑{n-1}-\A q
v↓2b↑{n-2} < b↑{n-2}(u↓2 + 1 + \A rb - \A q
v↓2) ≤ 0$. Since $u - \A qv < 0$, $q < \A q$.
\ansno 20. If $q ≤ \A q - 2$, then $u < (\A q - 1)v
< \A q(v↓1b↑{n-1} + (v↓2 + 1)b↑{n-2}\bigrp - v < \A q
v↓1b↑{n-1} + \A qv↓2b↑{n-2} + b↑{n-1} - v ≤ \A qv↓1b↑{n-1}
+ (b\A r + u↓2)b↑{n-2} + b↑{n-1} - v = u↓0b↑n + u↓1b↑{n-1}
+ u↓2b↑{n-2} + b↑{n-1} - v ≤ u↓0b↑n + u↓1b↑{n-1} + u↓2b↑{n-2}
≤ u$. In other words, $u < u$, and this is a contradiction.
\def\bigsl{\hbox{:a/}}
\ansno 21. (Solution by G. K. Goyal.)\xskip The inequality $v↓2\A q≤b\A r+u↓2$
implies that $\A q≤(u↓0b↑2+u↓1b+u↓2)/(v↓1b+v↓2)≤u\bigsl\biglp(v↓1b+v↓2)b↑{n-2}
\bigrp$. Now $u \mod v =
u-qv = v(1-α)$ where $0≤α=\A q-u/v≤u\biglp 1\bigsl\biglp(v↓1b+v↓2)b↑{n-2}
\bigrp-1/v\bigrp=u(v↓3b↑{n-3}+\cdotss)\bigsl\biglp(v↓1b+v↓2)b↑{n-2}v\bigrp
<u/(v↓1bv)≤\A q/(v↓1b)≤(b-1)/(v↓1b)$, and this is at most $2/b$ since
$v↓1≥{1\over2}(b-1)$.
%folio 758 galley 7a (C) Addison-Wesley 1978 *
\ansno 22. Let $u = 4100$, $v = 588$. We first try
$\A q = \lfloor {41\over 5}\rfloor = 8$, but $8 \cdot 8
> 10(41 - 40) + 0 = 10$. Then we set $\A q = 7$, and now
we find $7 \cdot 8 < 10(41 - 35) + 0 = 60$. But 7 times 588 equals
4116, so the true quotient is $q = 6$.\xskip (Incidentally, this example
shows that Theorem↔B cannot be improved under the given hypotheses,
when $b = 10$.)
\ansno 23. Obviously $v\lfloor b/(v + 1)\rfloor < (v + 1)\lfloor
b/(v + 1)\rfloor ≤ (v + 1)b/(v + 1) = b$; also if $v ≥ \lfloor
b/2\rfloor$ we obviously have $v\lfloor b/(v + 1)\rfloor ≥ v
≥ \lfloor b/2\rfloor $. Finally, assume that $1 ≤ v < \lfloor
b/2\rfloor $. Then $v\lfloor b/(v + 1)\rfloor > v\biglp b/(v
+ 1) - 1\bigrp ≥ b/2 - 1 ≥ \lfloor b/2\rfloor - 1$, be\-cause
$v\biglp b/(v + 1) - 1\bigrp - (b/2 - 1) = (b/2 - v - 1)(v - 1)/(v
+ 1) ≥ 0$. Finally since $v\lfloor b/(v + 1)\rfloor > \lfloor
b/2\rfloor - 1$, we must have $v\lfloor b/(v + 1)\rfloor ≥ \lfloor
b/2\rfloor $.
\ansno 24. The probability is only $\log↓b 2$, not $1\over 2$.\xskip
(For example, if $b = 2↑{35}$, the probability is approximately
${1\over 35}$; this is still high enough to warrant the special
test for $d = 1$ in steps D1 and D8.)
\def\mixansfive #1. #2{\def\\{\noalign{\penalty-200}}\annskip
\halign{\hbox to 19pt{##}⊗\rt{ \it##}\quad⊗\lft{\tt##}\quad⊗\lft{\tt##}\quad
⊗\lft{\tt##}\quad⊗$\ctr{##}$⊗\quad\lft{\rm##}\cr
{\hfill\bf #1. }#2}}
\mixansfive 25. {⊗002⊗⊗ENTA⊗1⊗1\cr
⊗003⊗⊗ADD⊗V+1⊗1\cr
⊗004⊗⊗STA⊗TEMP⊗1\cr
\\⊗005⊗⊗ENTA⊗1⊗1\cr
⊗006⊗⊗JOV⊗1F⊗1⊗Jump if $v↓1 = b - 1$.\cr
\\⊗007⊗⊗ENTX⊗0⊗1\cr
⊗008⊗⊗DIV⊗TEMP⊗1⊗Otherwise compute $b/(v↓1+ 1)$.\cr
⊗009⊗⊗JOV⊗DIVBYZERO⊗1⊗Jump if $v↓1 = 0$.\cr
\\⊗010⊗1H⊗STA⊗D⊗1\cr
⊗011⊗⊗DECA⊗1⊗1\cr
\\⊗012⊗⊗JANZ⊗*+3⊗1⊗Jump if $d ≠ 1$.\cr
⊗013⊗⊗STZ⊗U⊗1 - A\cr
⊗014⊗⊗JMP⊗D2⊗1 - A\cr
\\⊗015⊗⊗ENT1⊗N⊗A⊗Multiply $v$ by $d$.\cr
⊗016⊗⊗ENTX⊗0⊗A\cr
\\⊗017⊗2H⊗STX⊗CARRY⊗AN\cr
⊗018⊗⊗LDA⊗V,1⊗AN\cr
⊗019⊗⊗MUL⊗D⊗AN\cr
⊗$\cdots$⊗⊗⊗⊗⊗(as in exercise 13)\cr
⊗026⊗⊗J1P⊗2B⊗AN\cr
\\⊗027⊗⊗ENT1⊗M+N⊗A⊗(Now $\rX = 0$.)\cr
⊗028⊗2H⊗STX⊗CARRY⊗A(M + N)⊗Multiply $u$ by $d$.\cr
⊗029⊗⊗LDA⊗U,1⊗A(M + N)\cr
⊗$\cdots$⊗⊗⊗⊗⊗(as in exercise 13)\cr
⊗037⊗⊗J1P⊗2B⊗A(M + N)\cr
⊗038⊗⊗STX⊗U⊗A⊗\quad\blackslug\lower5pt\null\cr}
\ansno 26. (See the algorithm of exercise 16.)
{\yyskip\tabskip 22pt\mixfive{\!
101⊗D8⊗LDA⊗D⊗1⊗(Remainder will be left in loca-\cr
102⊗⊗DECA⊗1⊗1⊗\qquad tions \.{U+M+1} through \.{U+M+N})\cr
103⊗⊗JAZ⊗DONE⊗1⊗Terminate if $d = 1$.\cr
\\104⊗⊗ENN1⊗N⊗A⊗$\rI1 ≡ j - n - 1$; $j ← 1$.\cr
105⊗⊗ENTA⊗0⊗A⊗$r ← 0$.\cr
\\106⊗1H⊗LDX⊗U+M+N+1,1⊗AN⊗$\rAX ← rb+u↓{m+j}$.\cr
107⊗⊗DIV⊗D⊗AN\cr
108⊗⊗STA⊗U+M+N+1,1⊗AN\cr
109⊗⊗SLAX⊗5⊗AN⊗$r ← (rb + u↓{m+j}\mod d$.\cr
110⊗⊗INC2⊗1⊗AN⊗$j ← j + 1$.\cr
111⊗⊗J2N⊗1B⊗AN⊗Repeat for $1 ≤ j ≤ n$.\quad\blackslug\cr}}
\yyskip\noindent At this point, the division routine is complete; and
by the next exercise, register↔AX is zero.
\ansno 27. It is $du \mod dv = d(u \mod v)$.
\ansno 28. For convenience, let us assume that $v$ has a decimal
point at the left, i.e., $v = (v↓0.v↓1v↓2\ldotsm)↓b$. After
step N1 we have ${1\over 2} ≤ v < 1 + 1/b$: for
$$\eqalignno{v\,\left\lfloor b + 1\over v↓1 + 1\right\rfloor⊗≤{v(b + 1)\over v↓1
+ 1} = {v(1 + 1/b)\over (1/b)(v↓1 + 1)} < 1 + {1\over b},\cr
\noalign{\hbox{and}}
v\,\left\lfloor b + 1\over v↓1 + 1\right\rfloor ⊗≥ {v(b+1 - v↓1)\over v↓1
+ 1} ≥ {1\over b} {v↓1(b + 1 - v↓1)\over v↓1 + 1}.\cr}$$
The latter quantity takes its smallest value when $v↓1
= 1$, since it is a convex function and the other extreme value
is greater.
The formula in step N2 may be written
$$v ← \left\lfloor b(b + 1)\over v↓1 + 1\right\rfloor{v\over b},$$
so we see as above that $v$ will never become $≥1 + 1/b$.
The minimum value of $v$ after one iteration of
step N2 is $≥$
$$\eqalign{\left(b(b + 1) - v↓1\over v↓1 + 1\right){v\over
b} ≥ \left(b(b + 1) - v↓1\over v↓1 + 1\right){v↓1\over b↑2}⊗
= \left(b(b + 1) + 1 - t\over t\right)\left(t - 1\over
b↑2\right)\cr
\noalign{\vskip3pt}
⊗= 1 + {1\over b}+{2\over b↑2} - {1\over
b↑2}\left(t + {b(b + 1) + 1\over t}\right),\cr}$$
if $t = v↓1 + 1$. The minimum of this quantity occurs
for $t = b/2 + 1$; a lower bound is $1 - 3/2b$. Hence $v↓1 ≥
b - 2$, after one iteration of step N2. Finally, we have $(1
- 3/2b)(1+1/b)↑2 > 1$, when $b ≥ 5$, so at most two more
iterations are needed. The assertion is easily verified when
$b < 5$.
\ansno 29. True, since $(u↓j \ldotsm u↓{j+n})↓b < v$.
\ansno 30. In Algorithms A and S\null, such overlap is possible if
the algorithms are rewritten slightly; e.g., in Algorithm↔A\null,
we could rewrite step A2 thus: ``Set $t ← u↓j + v↓j + k$, $w↓j
← t \mod b$, $k ← \lfloor t/b\rfloor $.''
In Algorithm M\null, $v↓j$ may be in the same location
as $w↓j$. In Algorithm↔D\null, it is most convenient (as in
Program↔D\null, exercise↔26) to let $r↓1 \ldotsm r↓n$ be the same as $u↓{m+1}
\ldotsm u↓{m+n}$; and we can also have $q↓0 \ldotsm q↓m$ the same
as $u↓0 \ldotsm u↓m$, provided that no alteration of $u↓j$ is
made in step D6.\xskip (Line 098 of Program↔D can safely be changed to ``\.{J1P
2B}'', since $u↓j$ isn't used in the subsequent calculation.)
\ansno 31. Consider the situation
of Fig.↔6 with $u = (u↓ju↓{j+1} \ldotsm u↓{j+n})↓3$ as in Algorithm↔D\null.
If the leading nonzero digits of $u$ and $v$ have the same
sign, set $r ← u - v$, $q ← 1$; otherwise set $r ← u + v$, $q ←
-1$. Now if $|r| > |u|$, or if $|r| = |u|$ and the first nonzero
digit of $u↓{j+n+1} \ldotsm u↓{m+n}$ has the same sign as the
first nonzero digit of $r$, set $q ← 0$; otherwise set $u↓j
\ldotsm u↓{j+n}$ equal to the digits of $r$.
\ansno 36. Values to 1000 decimal and 1100 octal places have
been computed by R. P. Brent, Comp.\ Centre Tech.\ Rep.\ 47 (Canberra: Australian
Nat.\ Univ., 1975).
\ansno 37. Let $d=2↑e$ so that $b>dv↓1≥b/2$. Instead of normalizing $u$ and $v$ in
step D1, simply compute the two leading digits $v↓1↑\prime v↓2↑\prime$ of
$2↑e(v↓1v↓2v↓3)↓b$ by shifting left $e$ bits. In step D3, use $(v↓1↑\prime,
v↓2↑\prime)$ instead of $(v↓1,v↓2)$ and $(u↓j↑\prime,u↓{j+1}↑\prime,
u↓{j+2}↑\prime)$ instead of $(u↓j,u↓{j+1},u↓{j+2})$, where the digits
$u↓j↑\prime u↓{j+1}↑\prime u↓{j+2}↑\prime$ are obtained from $(u↓j
u↓{j+1}u↓{j+2}u↓{j+3})↓b$ by shifting left $e$ bits. Omit division by $d$ in
step D8.\xskip (In essence, $u$ and $v$ are being ``virtually'' shifted.
This method saves computation when $m$ is small
compared to $n$.)
\eject % assumes that page break at this point works OK, so file v2ans1 not too long